You saw in the last video what are the states of a reinforcement learning application, as well as how depending on the actions you take, you go through different states and also get to enjoy different rewards. But how do you know if a particular set of rewards is better or worse than a different set of rewards? The return in reinforcement learning, which we'll define in this video, allows us to capture that. As we go through this, one analogy that you might find helpful is if you imagine you have a $5 bill at your feet, you can reach down and pick up, or half an hour across town, you can walk half an hour and pick up a $10 bill. Which one would you rather go after? $10 is much better than $5, but if you need to walk for half an hour to go and get that $10 bill, then maybe it'd be more convenient to just pick up the $5 bill instead. The concept of a return captures that rewards you can get quicker are maybe more attractive than rewards that take you a long time to get to. Let's take a look at exactly how that works. Here's a Mars rover example. If starting from state 4, you go to the left, we saw that the rewards you get would be zero on the first step from state 4, zero from state 3, zero from state 2, and then 100 at state 1, the terminal state. The return is defined as the sum of these rewards, but weighted by one additional factor, which is called the discount factor. So the discount factor is a number a little bit less than one. So let me pick 0.9 as the discount factor. I'm going to weight the reward on the first step, which is zero. The reward on the second step is a discount factor, 0.9 times that reward, and then plus the discount factor squared times that reward, and then plus the discount factor cubed times that reward. And if you calculate this out, this turns out to be 0.729 times 100, which is 72.9. The more general formula for the return is that if your robot goes through some sequence of states and gets reward R1 on the first step, and R2 on the second step, and R3 on the third step, and so on, then the return is R1 plus the discount factor gamma, that's Greek alphabet gamma, which I've set to 0.9 in this example, but gamma times R2 plus gamma squared times R3 plus gamma cubed times R4, and so on, until you get to the terminal state. What the discount factor gamma does is it has the effect of making the reinforcement learning algorithm a little bit impatient, because the return gives full credit to the first reward, it's 100%, it's 1 times R1, but then it gives a little bit less credit to the reward you get at the second step, that's multiplied by 0.9, and then even less credit to the reward you get at the next time step, R3, and so on. And so getting rewards sooner results in a higher value for the total return. In many reinforcement learning algorithms, a common choice for the discount factor would be a number pretty close to 1, like 0.9 or 0.99 or even 0.999, but for illustrative purposes, in the running example I'm going to use, I'm actually going to use a discount factor of 0.5, so this very heavily downweights, or very heavily, we say, discounts rewards in the future, because with every additional passing time step, you get only half as much credit as rewards that you would have gotten one step earlier, and so if gamma were equal to 0.5, the return under the example above would have been 0 plus 0.5 times 0, replacing this equation on top, plus 0.5 squared, 0 plus 0.5 cubed, times 100, that's the last reward because state 1 is a terminal state, and this turns out to be a return of 12.5. In financial applications, the discount factor also has a very natural interpretation as the interest rate, or the time value of money. So if you can have a dollar today, that may be worth a little bit more than if you could only get a dollar in the future, because if you get a dollar today, you can put it in the bank, earn some interest, and end up with a little bit more money a year from now. So for financial applications, often that discount factor represents how much less is a dollar in the future worth compared to a dollar today. Let's look at some concrete examples of returns. The return you get depends on the rewards, and the rewards depend on the actions you take, and so the return depends on the actions you take. Let's use our usual example, and say for this example, I'm going to always go to the left. And so, we already saw previously that if the robot were to start off in state 4, the return is 12.5, as we worked out on the previous slide. It turns out that if it were to start off in state 3, the return would be 25, because it gets to the 100 reward one step sooner, and so it's discounted less. If it were to start off in state 2, the return would be 50, and if it were to just start off in state 1, well it gets the reward of 100 right away, so it's not discounted at all. And so the return, if it were to start off in state 1, would be 100, and then the return in these two states are 6.25. It turns out if you start off in state 6, which is terminal state, you just get the reward, and thus the return of 40. Now, if you were to take a different set of actions, the returns would actually be different. For example, if we were to always go to the right, if those were our actions, then if you were to start in state 4, get a reward of 0, then you get to state 5, get a reward of 0, and you get to state 6, and get a reward of 40. In this case, the return would be 0 plus 0.5, the discount factor, times 0 plus 0.5 squared times 40, and that turns out to be equal to 0.5 squared is one quarter, so one quarter of 40 is 10. And so the return from this state, from state 4, is 10. If you were to take actions, always go to the right. And through similar reasoning, the return from this state is 20, the return from this state is 5, the return from this state is 2.5, and then the return at the terminal state is 140. By the way, if these numbers don't fully make sense, feel free to pause the video and double check the math and see if you can convince yourself that these are the appropriate values for the return for if you start from different states and if you were to always go to the right. And so we see that if we were to always go to the right, the return you expect to get is lower for most states. So maybe always going to the right isn't as good an idea as always going to the left. But it turns out that we don't have to always go to the left or always go to the right. We could also decide if you're in state 2, go left. If you're in state 3, go left. If you're in state 4, go left. But if you're in state 5, then you're so close to this reward, let's go right. So this would be a different way of choosing actions to take based on what state you're in. And it turns out that the return you get from the different states will be 100, 50, 25, 12.5, 20, and 40. Just to illustrate one case, if you were to start off in state 5, here you would go to the right and so the rewards you get would be 0 first in state 5 and then 40. And so the return is 0, the first reward, plus the discount factor 0.5 times 40, which is 20, which is why the return from this state is 20 if you take actions shown here. So to summarize, the return in reinforcement learning is the sum of the rewards that the system gets but weighted by the discount factor, where rewards in the far future are weighted by the discount factor raised to a higher power. Now, this actually has an interesting effect when you have systems with negative rewards. In the example we went through, all the rewards were 0 or positive. But if there are any rewards that are negative, then the discount factor actually incentivizes the system to push out the negative rewards as far into the future as possible. Taking a financial example, if you had to pay someone $10, that's a negative reward of minus 10. But if you could postpone payment by a few years, then you're actually better off because $10 a few years from now, because of the interest rate, is actually worth less than $10 that you had to pay today. So for systems with negative rewards, it causes the algorithm to try to push out the negative rewards as far into the future as possible. And for financial applications and for other applications, that actually turns out to be the right thing for the system to do. You now know what is the return in reinforcement learning. Let's go on to the next video to formalize the goal of a reinforcement learning algorithm.

Unsupervised Learning, Recommenders, Reinforcement Learning

27 hours 16 mins

Week 3: Reinforcement learning

Reinforcement learning introduction

What is Reinforcement Learning?
Video
・
8 mins

Mars rover example
Video
・
6 mins

The Return in reinforcement learning
Video
・
10 mins

Making decisions: Policies in reinforcement learning
Video
・
2 mins

Review of key concepts
Video
・
5 mins

Practice quiz: Reinforcement learning introduction

Reinforcement learning introduction

Graded・Quiz

・

30 mins

State-action value function

State-action value function definition
Video
・
10 mins

State-action value function example
Video
・
5 mins

State-action value function (optional lab)
Code Example
・
1 hour

Bellman Equation
Video
・
12 mins

Random (stochastic) environment (Optional)
Video
・
8 mins

Quiz: State-action value function

State-action value function

Graded・Quiz

・

30 mins

Continuous state spaces

Example of continuous state space applications
Video
・
6 mins

Lunar lander
Video
・
5 mins

Learning the state-value function
Video
・
16 mins

Algorithm refinement: Improved neural network architecture
Video
・
3 mins

Algorithm refinement: ϵ-greedy policy
Video
・
8 mins

Algorithm refinement:  Mini-batch and soft updates (optional)
Video
・
11 mins

The state of reinforcement learning
Video
・
2 mins

Quiz: Continuous state spaces

Continuous state spaces

Graded・Quiz

・

30 mins

End of Access to Lab Notebooks

[IMPORTANT] Reminder about end of access to Lab Notebooks
Reading
・
2 mins

Practice Lab: Reinforcement Learning

Reinforcement Learning

Graded・Code Assignment

・

3 hours

Summary and thank you

Summary and thank you
Video
・
3 mins

Conversations with Andrew (Optional)

Andrew Ng and Chelsea Finn on AI and Robotics
Video
・
33 mins

Acknowledgments

Acknowledgments
Reading
・
2 mins

(Optional) Opportunity to Mentor Other Learners
Reading
・
1 min