We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In some applications, when you take an action, the outcome is not always completely reliable. For example, if you command your Mars rover to go left, maybe there's a little bit of a rock slide, or maybe the floor's really slippery, and so it slips and goes in the wrong direction. In practice, many robots don't always manage to do exactly what you tell them, because of wind blowing it off course, or the wheels slipping, or something else. So there's a generalization of the reinforcement learning framework we've talked about so far, which models random or stochastic environments. In this optional video, we'll talk about how these reinforcement learning problems work. Continuing with our simplified Mars rover example, let's say you take the action and command it to go left. Most of the time it'll succeed, but what if 10% of the time, or 0.1% of the time, it actually ends up accidentally slipping and going in the opposite direction? So if you command it to go left, it has a 90% chance, or 0.9% chance, of correctly going in the left direction, with a 0.1% chance of actually heading to the right. So that it has a 9% chance of ending up in state 3 in this example, and a 10% chance of ending up in state 5. Conversely, if you were to command it to go right, and take the action right, it has a 0.9% chance of ending up in state 5, and a 0.1% chance of ending up in state 3. This would be an example of a stochastic environment. Let's see what happens in this reinforcement learning problem. Let's say you use this policy shown here, where you go left in states 2, 3, and 4, and go right, or try to go right in state 5. If you were to start in state 4, and you were to follow this policy, then the actual sequence of states you visit may be random. For example, in state 4, you will go left, and maybe you're a little bit lucky, and it actually gets to state 3. And then you try to go left again, and maybe it actually gets there, you try to go left again, and it gets to that state. If this is what happens, you end up with the sequence of rewards 0, 0, 0, 100. But if you were to try this exact same policy a second time, maybe you're a little bit less lucky. The second time, you start here, try to go left, and say it succeeds. So you get 0 from state 4, 0 from state 3. Here you try to go left, but you got unlucky this time, and the robot slips, and ends up heading back to state 4 instead. And then you try to go left, then left, then left, and eventually it gets to that reward of 100. In that case, this will be the sequence of rewards you observe, because it went from 4 to 3, back to 4, 3, 2, then 1. Or, it's even possible, if you tell it from state 4 to go left, following the policy, you may get unlucky even on the first step, and you end up going to state 5, because it slipped. And then state 5, you command it to go right, and it succeeds, and so you end up here. And in this case, the sequence of rewards you see will be 0, 0, 40, because it went from 4 to 5, and then state 6. We had previously written out the return as this sum of discounted rewards. But when the reinforcement learning problem is stochastic, there isn't one sequence of rewards that you see for sure. Instead, you see this sequence of different rewards. So, in a stochastic reinforcement learning problem, what we're interested in is not maximizing the return, because that's a random number. What we're interested in is maximizing the average value of the sum of discounted rewards. And by average value, I mean, if you were to take your policy and try it out a thousand times, or a hundred thousand times, or a million times, you'd get lots of different reward sequences like that. And if you were to take the average over all of these different sequences of the sum of discounted rewards, then that's what we call the expected return. In statistics, the term expected is just another way of saying average. But what this means is we want to maximize what we expect to get on average in terms of the sum of discounted rewards. The mathematical notation for this is to write this as E. E stands for expected value of R1 plus gamma R2 plus and so on. So, the job of reinforcement learning algorithm is to choose a policy pi to maximize the average or the expected sum of discounted rewards. So, to summarize, when you have a stochastic reinforcement learning problem or a stochastic Markov decision process, the goal is to choose a policy that tells what action A to take in state S, so as to maximize the expected return. The last way that this changes what we've talked about is it modifies Bellman equation a little bit. So, here's Bellman equation exactly as we've written down. But the difference now is that when you take the action A in state S, the next state S prime you get to is random. When you're in state 3 and you try to go left, the next state S prime, it could be the state 2 or it could be the state 4. So, S prime is now random, which is why we also put an average operator or an expected operator here. So, we say that the total return from state S taking action A once and behaving optimally is equal to the reward you get right away, also called the immediate reward, plus the discount factor gamma, plus what you expect to get on average of the future returns. If you want to sharpen your intuition about what happens with these stochastic reinforcement learning problems, you go back to the optional lab that I had shown you just now, where this parameter, this step probability, is the probability of your Mars rover going in the opposite direction than you had commanded it to. So, if we set the step prop to be 0.1 and we execute the notebook, and so these numbers up here are the optimal return if you were to take the best possible actions, take this optimal policy, but the robot were to step in the wrong direction 10% of the time, and these are the Q values for the stochastic MDP. Notice that these values are now a little bit lower because you can't control the robot as well as before. The Q values as well as the optimal returns have gone down a bit. And in fact, if you were to increase the misstep probability, say, 40% of the time, the robot doesn't even go in the direction you had commanded it to. Only 60% of the time, it goes where you told it to. Then these values end up even lower because your degree of control over the robot has decreased. So, I encourage you to play with the optional lab and change the value of the misstep probability and see how that affects the optimal return or the optimal expected return as well as the Q values, Q of SA. Now, in everything we've done so far, we've been using this Markov decision process, this Mars rover, with just six states. For many practical applications, the number of states will be much larger. In the next video, we'll take the reinforcement learning or Markov decision process framework we've talked about so far and generalize it to this much richer and maybe even more interesting set of problems with much larger and with continuous state spaces. Let's take a look at that in the next video.