We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Let me summarize where we are. If you can compute the state action value function, Q of S A, then it gives you a way to pick a good action from every state. Just pick the action A that gives you the largest value of Q of S A. So the question is, how do you compute these values, Q of S A? In reinforcement learning, there's a key equation called the Bellman equation that will help us to compute the state action value function. Let's take a look at what is this equation. As a reminder, this is the definition of Q of S A. As a return, if you start at state S, take the action A once, and then behave optimally after that. In order to describe the Bellman equation, I'm going to use the following notation. I'm going to use S to denote the current state. Next, I'm going to use R of S to denote the reward of the current state. So for our low MDP example, we would have that R of 1, state 1, is 100, the reward of state 2 is 0, and so on, and the reward of state 6 is 40. I'm going to use the alphabet A to denote the current action, so the action that you take in the state S. After you take the action A, you get to some new state. For example, if you're in state 4 and you take the action left, then you get to state 3, and so I'm going to use S prime to denote the state you get to after taking that action A from the current state S. I'm also going to use A prime to denote the action that you might take in state S prime, the new state that you got to. The notation convention, by the way, is that S A corresponds to the current state in action, and when we add the prime, that's the next state and the next action. The Bellman equation is the following. It says that Q of S A, that is, the return under this set of assumptions, that's equal to R of S, so the reward you get for being in that state, plus the discount factor gamma times max over all possible actions A prime of Q of S prime, the new state you just got to, and then of A prime. There's a lot going on in this equation, so let's first take a look at some examples that we'll come back to see why this equation might make sense. Let's look at an example. Let's look at Q of state 2 and action right, and apply Bellman equation to this to see what value it gives us. So if the current state is state 2, and if the action is to go right, then the next state you get to after going right, S prime, would be the state 3. So the Bellman equation says Q of 2 right is R of S, so this R of state 2, which is just a reward 0, plus the discount factor gamma, which we've set to 0.5 in this example, times max of the Q values in state S prime, in state 3. So this is going to be the max of 25 and 6.25, since this is max over A prime of Q of S prime, A prime, and this is taking the larger of 25 or 6.25, because those are the two choices for state 3, and this turns out to be equal to 0 plus 0.5 times 25, which is equal to 12.5, which fortunately is Q of 2 and then the action right. Let's look at just one more example. Let me pick state 4 and see what is Q of state 4 if you decide to go left. In this case, the current state is 4, current action is to go left, and so the next state, if you start from 4 and go left, you end up also at state 3. So S prime is 3 again. Development equation will say this is equal to R of S, so R of state 4, which is 0, plus 0.5, the discount factor gamma, of max over A prime of Q of S prime, that is the state 3 again, comma A prime. So once again, the Q values for state 3 are 25 and 6.25, and the larger of these is 25, and so this works out to be R of 4 is 0 plus 0.5 times 25, which is again equal to 12.5. So that's why Q of 4 with the action left is also equal to 12.5. Just one note, if you're in a terminal state, then development equation simplifies to Q of S A equals to R of S, because there's no state S prime, and so that second term would go away, which is why Q of S A in the terminal state is just 100, 100, or 40, 40. If you wish, feel free to pause the video and apply the development equation to any other state action in this MDP and check for yourself if this math works out. Just to recap, this is how we had defined Q of S A, and we saw earlier that the best possible return from any state S is max over A Q of S A. In fact, just to rename S and A, it turns out that the best possible return from a state S prime is max over S prime of A prime. I didn't really do anything other than rename S S prime and A to A prime, but this will make some of the intuitions a little bit easier later. But for any state S prime, like state 3, the best possible return from state 3 is the max over all possible actions of Q of S prime A prime. So here again is development equation, and the intuition that this captures is if you're starting from state S and you're going to take action A and then act optimally after that, then you're going to see some sequence of rewards over time. In particular, the return will be computed from the reward at the first step plus gamma times the reward at the second step plus gamma squared times the reward at the third step and so on, plus dot, dot, dot, until you get to terminal state. So what Bellman equation says is this sequence of rewards with the discount factors can be broken down into two components. First, this R of S, that's the reward you get right away. In the reinforcement learning literature, this is sometimes also called the immediate reward, but that's what R1 is, is the reward you get for starting out in some state S. The second term then is the following. After you start in state S and take action A, you get to some new state S prime. The definition of Q of S A assumes we're going to behave optimally after that. So after we get to S prime, we're going to behave optimally and get the best possible return from the state S prime. And so what this is, max over A prime of Q of S prime A prime, this is the return from behaving optimally starting from the state S prime. That's exactly what we had written up here, is the best possible return for when you start from state S prime. Another way of phrasing this is, this total return down here is also equal to R1 plus, and I'm going to factor out gamma in the math, it's gamma times R2 plus, and instead of gamma squared, it's just gamma times R3 plus gamma squared times R4 plus dot dot dot. Notice that if you were starting from state S prime, the sequence of rewards you get will be R2, then R3, then R4, and so on. And that's why this expression here, that's the total return if you were to start from state S prime. And if you were to behave optimally, then this expression should be the best possible return for starting from state S prime, which is why this sequence of discounted rewards equals that, max over A prime of Q of S prime A prime, and they were also left over with this extra discount factor gamma there, which is why Q of S A is also equal to this expression over here. In case you think this is quite complicated and you aren't following all the details, don't worry about it. So long as you apply this equation, you will manage to get the right results. But the high-level intuition I hope you take away is that the total return you get in a reinforcement learning problem has two parts. The first part is this reward that you get right away, and then the second part is gamma times the return you get starting from the next state S prime, and as these two components together, R of S plus gamma times the return from the next state that is equal to the total return from the current state S. That is the essence of the Bellman equation. So just to relate this back to our earlier example, Q of 4 left, that's the total return for starting in state 4 and going left. So if you were to go left in state 4, the rewards you get are 0 in state 4, 0 in state 3, 0 in state 2, and then 100, which is why the total return is this, 0.5 squared plus 0.5 cubed, which is 12.5. And what Bellman equation is saying is that we can break this up into two pieces. There is this 0, which is R of the state 4, and then plus 0.5 times this other sequence, 0 plus 0.5, 0 plus 0.5 squared times 100. But if you look at what this sequence is, this is really the optimal return from the next state S prime that you got to after taking the action left from state 4. So that's why this is equal to the reward 4 plus 0.5 times the optimal return from state 3, because if you were to start from state 3, the rewards you get would be 0 followed by 0 followed by 100. So this is the optimal return from state 3, and that's why this is just R of 4 plus 0.5 max of A prime Q of state 3 A prime. I know the Bellman equation is a somewhat complicated equation, breaking down your total returns into the reward you get right away, the immediate reward, plus gamma times the returns from the next state S prime. If it kind of makes sense to you but not fully, it's okay, don't worry about it. You can still apply Bellman's equations to get a reinforcement learning algorithm to work correctly. But I hope that at least a high level intuition of why breaking down the rewards into what you get right away plus what you get in the future, I hope that makes sense. Before moving on to develop a reinforcement learning algorithm, we have coming up next an optional video on stochastic Markov decision processes or on reinforcement learning applications where the actions that you take can have a slightly random effect. Take a look at the optional video if you wish, and then after that, we'll start to develop a reinforcement learning algorithm.