We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In this video, we'll look at two further refinements to the reinforcement learning algorithm you've seen. The first idea is called using mini-batches. This turns out to be an idea that can both speed up your reinforcement learning algorithm, and is also applicable to supervised learning, and can help you speed up your supervised learning algorithm as well, like training a neural network or training a linear regression or logistic regression model. The second idea we'll look at is soft updates, which it turns out will help your reinforcement learning algorithm do a better job to converge to a good solution. Let's take a look at mini-batches and soft updates. To understand mini-batches, let's just look at supervised learning to start. Here's a dataset of housing sizes and prices that you had seen way back in the first course of the specialization on using linear regression to predict housing prices. There, we had come up with this cost function for the parameters w and b. It was 1 over 2m sum of the prediction minus the actual value y squared. The gradient descent algorithm was to repeatedly update w as w minus the learning rate alpha times the partial derivative with respect to w of the cost j of wb and similarly to update b as follows. Let me just take this definition of j of wb and substitute it in here. Now, when we looked at this example way back when we're starting to talk about linear regression and supervised learning, the training set size m was pretty small. I think we had 47 training examples. But what if you have a very, very large training set, say m equals 100 million. There are many countries, including the United States with over 100 million housing units, and so a national census will give you a dataset that is this order of magnitude of size. The problem with this algorithm when your dataset is this big is that every single step of gradient descent requires computing this average over 100 million examples, and this turns out to be very slow. Every step of gradient descent means you would compute this sum or this average over 100 million examples. Then you take one tiny gradient descent step and you go back and have to scan over your entire 100 million example dataset again to compute the derivative on the next step. Then you take another tiny gradient descent step and so on and so on. When the training set size is very large, this gradient descent algorithm turns out to be quite slow. The idea of mini-batch gradient descent is to not use all 100 million training examples on every single iteration through this one. Instead, we may pick a smaller number. Let me call it m prime equals, say, 1,000 and on every step, instead of using all 100 million examples, we will pick some subset of 1,000 or m prime examples. This inner term becomes 1 over 2 m prime sum over some m prime examples. Now, each iteration through gradient descent requires looking only at 1,000 rather than 100 million examples and every step takes much less time and just leads to a more efficient algorithm. What mini-batch gradient descent does is on the first iteration through the algorithm, maybe it looks at that subset of the data. On the next iteration, maybe it looks at that subset of the data and so on, for the third iteration and so on, so that every iteration is looking at just a subset of the data, so each iteration runs much more quickly. To see why this might be a reasonable algorithm, here's the housing dataset. If on the first iteration, we were to look at just, say, five examples, this is not the whole dataset, but it's slightly representative of the straight line you might want to fit in the end and so taking one gradient descent step to make the algorithm better fit these five examples is okay. But then on the next iteration, you take a different five examples like that shown here. You take one gradient descent step using these five examples, and on the next iteration, you use a different five examples, and so on and so forth. You can scan through this list of examples from top to bottom. That would be one way. Another way would be if on every single iteration, you just pick a totally different five examples to use. You might remember with batch gradient descent, if these are the contours of the cost function J, then batch gradient descent would say, start here and take a step, take a step, take a step, take a step, take a step. Every step of gradient descent causes the parameters to reliably get closer to the global minimum of the cost function here in the middle. In contrast, mini-batch gradient descent or mini-batch learning algorithm will do something like this. If you start here, then the first iteration uses just five examples. So it'll kind of head in the right direction, but maybe not the best gradient descent direction. Then the next iteration, it may do that, the next iteration, that, and that, that, and sometimes just by chance. The five examples you chose may be an unlucky choice, and even head in the wrong direction, away from the global minimum, and so on and so forth. But on average, mini-batch gradient descent will tend towards the global minimum, not reliably and somewhat noisily, but every iteration is much more computationally inexpensive. So mini-batch learning or mini-batch gradient descent turns out to be a much faster algorithm when you have a very large training set. So in fact, for supervised learning, when you have a very large training set, mini-batch learning or mini-batch gradient descent or mini-batch version with other optimization algorithms like Adam, is used more common than batch gradient descent. Going back to our reinforcement learning algorithm, this is the algorithm that we had seen previously. So the mini-batch version of this would be, even if you have stored the 10,000 most recent tuples in the replay buffer, what you might choose to do is not to use all 10,000 every time you train a model. Instead, what you might do is just take a subset. So you might choose just 1,000 examples of these S, A, R of S, S prime tuples, and use it to create just 1,000 training examples to train the neural network. It turns out that this will make each iteration of training a model a little bit more noisy, but much faster. This will overall tend to speed up this reinforcement learning algorithm. So that's how mini-batching can speed up both a supervised learning algorithm like linear regression, as well as this reinforcement learning algorithm, where you may use a mini-batch size of, say, 1,000 examples, even if you stored away 10,000 of these tuples in your replay buffer. Finally, there's one other refinement to the algorithm that can make it converge more reliably, which is I have written out this step here of set Q equals Q new. But it turns out that this can make a very abrupt change to Q. If you train a new neural network, Q new, maybe just by chance is not a very good neural network. Maybe it's even a little bit worse than the old one. Then you just overwrote your Q function with a potentially worse noisy neural network. So the self-update method helps to prevent Q new through just one unlucky step, getting worse. In particular, the neural network Q will have some parameters W and B, all the parameters for all the layers of the neural network. When you train the new neural network, you get some parameters W new and B new. So in the original algorithm as described on that slide, you would set W to be equal to W new and B equals B new. That's what set Q equals Q new means. With the self-update, what we do is instead set W equals 0.01 times W new plus 0.99 times W. In other words, we're going to make W to be 99 percent, the old version of W plus one percent of the new version W new. So this is called a self-update because whenever we train a new neural network, W new, we're only going to accept a little bit of the new value. Similarly, B equals 0.01 times B new plus 0.99 times B. These numbers 0.01 and 0.99, these are hyperparameters that you could set. But it controls how aggressively you move W to what W new, and these two numbers are expected to add up to one. One extreme would be if you were to set W equals one times W new plus zero times W, in which case, you're back to the original algorithm up here, where you're just copying W new onto W. But the self-update allows you to make a more gradual change to Q or to the neural network parameters W and B that affect your current guess for the Q function, Q of S A. It turns out that using the self-update method causes the reinforcement learning algorithm to converge more reliably. It makes it less likely that the reinforcement learning algorithm will oscillate or diverge or have other undesirable properties. With these two final refinements to the algorithm, mini-batching, which actually applies very well to supervise learning as well, not just reinforcement learning, as well as the idea of self-updates, you should be able to get your learning algorithm to work really well on the Lunar Lander. The Lunar Lander is actually a decently complex, decently challenging application, and so that you can get it to work and land safely on the moon. I think that's actually really cool, and I hope you enjoy playing with the practice lab. Now, we've talked a lot about reinforcement learning. Before we wrap up, I'd like to share with you my thoughts on the state of reinforcement learning, so that as you go out and build applications using different machine learning techniques, be it supervised, unsupervised, reinforcement learning techniques, that you have a framework for understanding where reinforcement learning fits in to the world of machine learning today. So let's go take a look at that in the next video.