We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Let's see how we can use reinforcement learning to control the lunar lander or for other reinforcement learning problems. The key idea is that we're going to train a neural network to compute or to approximate the state action value function, Q of SA, and that in turn will let us pick good actions. Let's see how this works. The heart of the learning algorithm is we're going to train a neural network that inputs the current state and the current action and computes or approximates Q of SA. In particular, for the lunar lander, we're going to take the state S and any action A and put them together. Briefly, the state was that list of eight numbers that we saw previously. So you have X, Y, X dot, Y dot, theta, theta dot, and then L, R for whether the lakes are grounded. So that's a list of eight numbers to describe the state. Then finally, we have four possible actions, nothing left, main or main engine, and right. And we can encode any of those four actions using a one-hot feature vector. So if action were the first action, we may encode it using 1, 0, 0, 0. Or if it was the second action to fire the left thruster, we may encode it as 0, 1, 0, 0. So this list of 12 numbers, eight numbers for the state, and then four numbers, a one-hot encoding of the action is the input we'll have to the neural network. And I'm going to call this X. We'll then take these 12 numbers and feed them to a neural network with, say, 64 units in the first hidden layer, 64 units in the second hidden layer, and then a single output in the output layer. And the job of the neural network is to output Q of S, A, the state action value function for the lunar lander, given the input S and A. And because we'll be using neural network training algorithms in a little bit, I'm also going to refer to this value, Q of S, A, as the target value, Y, that we'll train the neural network to approximate. Notice that I did say reinforcement learning is different from supervised learning. But what we're going to do is not input a state and have it output an action. What we're going to do is input a state-action pair and have it try to output Q of S, A. And using a neural network inside the reinforcement learning algorithm this way will turn out to work pretty well. We'll see the details in a little bit. So don't worry about it if it doesn't make sense yet. But if you can train a neural network with appropriate choices of parameters in the hidden layers and in the output layer to give you good estimates of Q of S, A, then whenever your lunar lander is in some state, S, you can then use the neural network to compute Q of S, A for all four actions. You can compute Q of S, nothing, Q of S, left, Q of S, main, Q of S, right. And then finally, whichever of these has the highest value, you would pick the corresponding action, A. So for example, if out of these four values, Q of S, main is largest, then you would decide to go and fire the main engine of the lunar lander. So the question becomes, how do you train a neural network to output Q of S, A? It turns out the approach will be to use Bellman's equations to create a training set with lots of examples, X and Y, and then we'll use supervised learning, exactly as you learned in the second course when we talked about neural networks, to learn, using supervised learning, a mapping from X to Y. That is a mapping from the state action pair to this target value, Q of S, A. But how do you get a training set with values for X and Y that you can then train a neural network on? Let's take a look. So here's the Bellman equation, Q of S, A equals R of S plus gamma, max of A prime, Q of S prime, A prime. So the right-hand side is what you want Q of S, A to be equal to. So I'm going to call this value on the right-hand side, Y. And the input to the neural network is a state and an action, so I'm going to call that X. And the job of a neural network is to input X, that is, input a state-action pair, and try to accurately predict what will be the value on the right. So in supervised learning, we were training a neural network to learn a function, F, which depends on a bunch of parameters, W and B, the parameters of the various layers of the neural network. And it was the job of the neural network to input X and hopefully output something close to the target value, Y. So the question is, how can we come up with a training set with values X and Y for a neural network to learn from? Here's what we're going to do. We're going to use the lunar lander and just try taking different actions in it. If we don't have a good policy yet, we'll take actions randomly, fire the left thruster, fire the right thruster, fire the main engine, do nothing. And by just trying out different things in the lunar lander simulator, we'll observe a lot of examples of when we're in some state and we took some action, maybe a good action, maybe a terrible action, either way. And then we got some rewards, R of S, for being in that state. And as a result of our action, we got to some new state, S prime. As you take different actions in the lunar lander, you see these S, A, R of S, S prime, and we call them tuples in Python code, many times. For example, maybe one time you're in some state, S, and just to give this an index, I'm going to call this S1. And you happen to take some action, A1. This could be nothing, left, main thruster, or right. As a result of which, you got some reward, and you wound up at some state, S prime 1. And maybe a different time, you're in some other state, S2. You took some other action, could be a good action, could be a bad action, could be any of the four actions, and you got the reward, and then you wound up with S prime 2, and so on, multiple times. And maybe you've done this 10,000 times, or even more than 10,000 times. So you would have to save the way with not just S1, A1, and so on, but up to S10,000, A10,000. It turns out that each of these lists of four elements, each of these tuples, will be enough to create a single training example, X1, Y1. In particular, here's how you do it. There are four elements in this first tuple. The first two will be used to compute X1, and the second two will be used to compute Y1. In particular, X1 is just going to be S1, A1 put together. S1 would be eight numbers, the state of the lunar lander. A1 would be four numbers, the one-hot encoding of whatever action this was. And Y1 would be computed using the right-hand side of the Bellman equation. In particular, the Bellman equation says when you input S1, A1, you want Q of S1, A1 to be this right-hand side, to be equal to R of S1 plus gamma max over A prime of Q of S1 prime A prime. And notice that these two elements of the tuple on the right give you enough information to compute this. You know what is R of S1, that's the reward you've saved away here, plus the discount factor gamma times max over all actions A prime of Q of S prime 1. That's the state you got to in this example. And then take the max over all possible actions A prime. And so I'm going to call this Y1. And when you compute this, this will be some number like 12.5 or 17 or 0.5 or some other number. And we'll save that number here as Y1 so that this pair, X1, Y1, becomes the first trading example in this little data set we're computing. Now, you may be wondering, wait, where does Q of S prime A prime or Q of S prime 1 A prime come from? Well, initially we don't know what is the Q function. But it turns out that when you don't know what is the Q function, you can start off with taking a totally random guess of what is the Q function. And we'll see on the next slide that the algorithm will work nonetheless. But in every step, Q here is just going to be some guess. They'll get better over time, it turns out, of what is the actual Q function. Let's look at a second example. If you had a second experience where you're in state S2, took action A2, got that reward, and then got to that state, then we would create a second trading example in this data set, X2, where the input is now S2, A2. So the first two elements go to computing the input X. And then Y2 will be equal to R of S2 plus gamma max over A prime Q of S prime 2 A prime. And whatever this number is, Y2, we put this over here in our small but growing training set. And so on and so forth, until maybe you end up with 10,000 training examples with these X, Y pairs. And what we'll see later is that we'll actually take this training set where the Xs are inputs with 12 features, and the Ys are just numbers, and we'll train a neural network with, say, the mean squared error loss to try to predict Y as a function of the input X. So what I describe here is just one piece of the learning algorithm we'll use. Let's put it all together on the next slide and see how it all comes together into a single algorithm. So let's take a look at what the full algorithm for learning the Q function is like. First, we're going to take our neural network and initialize all the parameters of the neural network randomly. Initially, we have no idea what is the Q function, so let's just pick totally random values of the weights, and we'll pretend that this neural network is our initial random guess for the Q function. This is a little bit like when you are training linear regression, and you initialize all the parameters randomly and then use gradient descent to improve the parameters. Initializing randomly for now is fine. What's important is whether the algorithm can slowly improve the parameters to get to a better estimate. Next, we will repeatedly do the following. We will take actions in the lunar lander. So fly it around randomly, take some good actions, take some bad actions, it's okay either way. But you get lots of these tuples of when it was in some state, you took some action A, got a reward R of S, and you got to some state S'. And what we will do is score the 10,000 most recent examples of these tuples. As you run this algorithm, you will see many, many steps in the lunar lander, maybe hundreds of thousands of steps. But to make sure we don't end up using excessive computer memory, common practice is to just remember the 10,000 most recent such tuples that we saw taking actions in the NTP. This technique of storing the most recent examples only is sometimes called the replay buffer in a reinforcement learning algorithm. So for now, we're just flying the lunar lander randomly, sometimes crashing, sometimes not. And getting these tuples as experience for our learning algorithm. Occasionally then, we will train the neural network. In order to train the neural network, here's what we'll do. We'll look at these 10,000 most recent tuples we have saved and create a training set of 10,000 examples. So the training set needs lots of pairs of X and Y. And for our training examples, X will be the SA from this part of the tuple. So it will be a list of 12 numbers. The 8 numbers for the state and the 4 numbers for the one-hot encoding of the action. And the target value that we want the neural network to try to predict will be Y equals R of S plus gamma max of A' Q of S' A'. How do we get this value of Q? Well, initially, it is this neural network that we have randomly initialized. So it may not be a very good guess, but it's a guess. After creating these 10,000 training examples, we'll have training examples X1, Y1 through X 10,000, Y 10,000. And so we'll train a neural network, and I'm going to call the new neural network Q new, such that Q new of S A learns to approximate Y. So this is exactly training that neural network to output F with parameters W and B to input X to try to approximate the target value Y. Now, this neural network should be a slightly better estimate of what the Q function or the state action value function should be. And so what we'll do is we're going to take Q and set it to this new neural network that we had just learned. Many of the ideas in this algorithm are due to Min et al. And it turns out that if you run this algorithm where you start with a really random guess of the Q function, then use Bellman's equations to repeatedly try to improve the estimates of the Q function, then by doing this over and over, taking lots of actions, training a model, that will improve your guess for the Q function. And so for the next model you train, you now have a slightly better estimate of what is the Q function. And then the next model you train will be even better. And when you update Q equals Q new, then for the next time you train a model, Q of S prime A prime will be an even better estimate. And so as you run this algorithm on every iteration, Q of S prime A prime hopefully becomes an even better estimate of the Q function. So that when you run the algorithm long enough, this will actually become a pretty good estimate of the true value of Q of S A, so that you can then use this to pick hopefully good actions for the MTP. The algorithm you just saw is sometimes called the DQN algorithm, which stands for Deep Q Network, because you're using deep learning, a neural network, to train a model to learn the Q function. So hence DQN or Deep Q Network, DQ using a neural network. And if you use the algorithm as I described it, it will kind of work okay on the lunar lander. Maybe it will take a long time to converge, maybe it won't land perfectly, but it will sort of work. But it turns out that with a couple of refinements to the algorithm, it can work much better. So in the next few videos, let's take a look at some refinements to the algorithm that you just saw.