We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
When we start to develop reinforcement learning arrows later this week, you see that there's a key quantity that reinforcement learning arrows will try to compute, and that's called the state action value function. Let's take a look at what this function is. The state action value function is a function typically denoted by the letter uppercase Q, and it's a function of a state you might be in, as well as the action you might choose to take in that state. And Q of S A will give a number that equals the return if you start in that state S and take the action A just once, and after taking action A once, you then behave optimally after that. And after that, you take whatever actions will result in the highest possible return. Now you might be thinking, there's something a little bit strange about this definition, because how do we know what is the optimal behavior? And if we knew what's the optimal behavior, if we already knew what's the best action to take in every state, why do we still need to compute Q of S A? Because we already have the optimal policy. So I do want to acknowledge that there's something a little bit strange about this definition. There's almost something a little bit circular about this definition, but rest assured when we look at specific reinforcement learning algorithms later, we'll resolve this slightly circular definition, and we'll come up with a way to compute the Q function even before we've come up with the optimal policy, but you see that in a later video, so don't worry about this for now. Let's look at an example. We saw previously that this is a pretty good policy. Go left from states 2, 3, and 4, and go right from state 5. It turns out that this is actually the optimal policy for the Mars Rover application when the discount factor gamma is 0.5. So Q of S A will be equal to the total return if you start from state S, take the action A, and then behave optimally after that, meaning take actions according to this policy shown over here. Let's figure out what Q of S A is for a few different states. Let's look at, say, Q of state 2, and what if we take the action to go right? Well if you're in state 2 and you go right, then you end up at state 3, and then after that you behave optimally. You're going to go left from state 3, and then go left from state 2, and then eventually get the reward of 100. In this case, the rewards you get would be 0 from state 2, 0 when you get to state 3, 0 when you get back to state 2, and then 100 when you finally get to the terminal state 1. And so the return will be 0 plus 0.5 times that plus 0.5 squared times that plus 0.5 cubed times 100, and this turns out to be 12.5. And so Q of state 2 of going right is equal to 12.5. Note that this passes no judgment on whether going right is a good idea or not. It's actually not that good an idea from state 2 to go right, but it just faithfully reports out the return if you take action A and then behave optimally afterward. Here's another example. If you're in state 2 and you were to go left, then the sequence of rewards you get will be 0 when you're in state 2, followed by 100, and so the return is 0 plus 0.5 times 100, and that's equal to 50. In order to write down the values of QSA in this diagram, I'm going to write 12.5 here on the right to denote that this is Q of state 2 going to the right, and I'm going to write a little 50 here on the left to denote that this is Q of state 2 going to the left. Just to take one more example, what if we're in state 4 and we decide to go left? Well if you're in state 4 and you go left, you get reward 0, and then you take action left here, so 0 again, take action left here, 0, and then 100, so Q of 4 left results in rewards 0 because the first action is left, and then because we follow the optimal policy afterward, you get rewards 0, 0, 100, and so the return is 0 plus 0.5 times that plus 0.5 squared times that plus 0.5 cubed times that, which is therefore equal to 12.5. So Q4 left is 12.5, I'm going to write this here as 12.5. And it turns out if you were to carry out this exercise for all of the other states and all of the other actions, you end up with this being the Q of SA for different states and different actions. And then finally at the terminal state, well it doesn't matter what you do, you just get that terminal reward 100 or 40, so I'll just write down those terminal rewards over here. So this is Q of SA for every state, state 1 through 6, and for the two actions, action left and action right. Because the state action value function is almost always denoted by the letter Q, this is also often called the Q function. So the terms Q function and state action value function are used interchangeably, and it tells you what are your returns, or really what is the value, how good is it to take action A in state S, and then behave optimally after that. Now it turns out that once you can compute the Q function, this would give you a way to pick actions as well. Here's the policy and return, and here are the values Q of SA from the previous slide. You notice one interesting thing when you look at the different states, which is that if you take state 2, taking the action left results in a Q value or state action value of 50, which is actually the best possible return you can get from that state. In state 3, Q of SA for the action left also gives you that higher return. In state 4, the action left gives you the return you want, and in state 5, it's actually the action going to the right that gives you that higher return of 20. So it turns out that the best possible return from any state S is the largest value of Q of SA, maximizing over A. Just to make sure this is clear, what I'm saying is that in state 4, there is Q of state 4 left, which is 12.5, and Q of state 4 right, which turns out to be 10, and the larger of these two values, which is 12.5, is the best possible return from that state 4. In other words, the highest return you can hope to get from state 4 is 12.5, and it's actually the larger of these two numbers, 12.5 and 10. And moreover, if you want your Mars rover to enjoy a return of 12.5 rather than state 10, then the action you should take is the action A that gives you the larger value of Q of SA. So the best possible action in state S is the action A that actually maximizes Q of SA. So this might give you a hint for why computing Q of SA is an important part of the reinforcement learning algorithm that we'll build later. Namely, if you have a way of computing Q of SA for every state and for every action, then when you're in some state S, all you have to do is look at the different actions A and pick the action A that maximizes Q of SA. And some pi of S can just pick the action A that gives the largest value of Q of SA, and that will turn out to be a good action. In fact, it'll turn out to be the optimal action. Another intuition about why this makes sense is Q of SA is returned if you start in state S and take the action A and then behave optimally after that. So in order to earn the biggest possible return, what you really want is to take the action A that results in the biggest total return. That's why if only we have a way of computing Q of SA for every state, taking the action A that maximizes the return under these circumstances seems like it's the best action to take in that state. Although this isn't something you need to know for this course, I want to mention also that if you look online or look at the reinforcement learning literature, sometimes you also see this Q function written as Q star instead of Q, and this Q function is sometimes also called the optimal Q function. These terms just refer to the Q function exactly as we've defined it. So if you look at the reinforcement learning literature and read about Q star or the optimal Q function, that just means the state action value function that we've been talking about. But for the purposes of this course, you don't need to worry about this. So to summarize, if you can compute Q of SA for every state and every action, then that gives us a good way to compute the optimal policy pi of S. So that's the state action value function or the Q function. We'll talk later about how to come up with an algorithm to compute them, despite the slightly circular aspect of the definition of the Q function. But first, let's take a look in the next video at some specific examples of what these values QSA actually look like.