Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
The process of building a decision tree, given a training set, has a few steps. In this video, let's take a look at the overall process of what you need to do to build a decision tree. Given a training set of 10 examples of cats and dogs like you saw in the last video, the first step of decision tree learning is we have to decide what feature to use at the root node, that is, the first node at the very top of the decision tree. Via an algorithm that we'll talk about in the next few videos, let's say that we decide to pick as the feature in the root node the ear shape feature. What that means is we will decide to look at all of our training examples, all 10 training examples shown here, and split them according to the value of the ear shape feature. In particular, let's pick out the five examples with pointy ears and move them over down to the left, and let's pick the five examples with floppy ears and move them down to the right. The second step is focusing just on the left part, or sometimes called the left branch of the decision tree, to decide what node to put over there, and in particular, what feature do we want to split on, or what feature do we want to use next. Via an algorithm that, again, we'll talk about later this week, let's say you decide to use the face shape feature there. What we'll do now is take these five examples and split these five examples into two subsets based on their value of the face shape. We'll take the four examples out of these five with a round face shape and move them down to the left, and the one example with a not round face shape and move it down to the right. Finally, we notice that these four examples are all cats. All four of them are cats. Rather than splitting further, we create a leaf node that makes a prediction that things that get down to that node are cats. Over here, we notice that none of the examples, zero of the one examples, are cats, or alternatively, 100% of the examples here are dogs. We can create a leaf node here that makes a prediction of not cat. Having done this on the left part, or the left branch of this decision tree, we now repeat a similar process on the right part, or the right branch of this decision tree and focus attention on just these five examples, which contains one cat and four dogs. We would have to pick some feature over here to use to split these five examples further. If we end up choosing the whiskers feature, we would then split these five examples based on whether whiskers are present or absent, like so. You notice that one of the one examples on the left are cats, and zero of the four are cats. Each of these nodes is completely pure, meaning that it's all cats or all not cats, and there's no longer a mix of cats and dogs. We can create these leaf nodes, making a cat prediction on the left and a not cat prediction here on the right. This is the process of building a decision tree. Through this process, there were a couple of key decisions that we had to make at various steps during the algorithm. Let's talk through what those key decisions were, and we'll keep on fleshing out the details of how to make these decisions in the next few videos. The first key decision was, how do you choose what feature to use to split on at each node? At the root node, as well as on the left branch and the right branch of the decision tree, we had to decide if there were a few examples at that node comprising a mix of cats and dogs, do you want to split on the ear shape feature, or the face shape feature, or the whiskers feature? We'll see in the next video that decision trees will choose what feature to split on in order to try to maximize purity. By purity, I mean you want to get to what subsets, which are as close as possible to all cats or all dogs. For example, if we had a feature that said, does this animal have cat DNA? We don't actually have this feature, but if we did, we could have split on this feature at the root node, which would have resulted in 5 out of 5 cats in the left branch and 0 out of 5 cats in the right branch, and both these left and right subsets of the data are completely pure, meaning that there's only one class, either cats only or not cats only, in both of these left and right sub-branches, which is why the cat DNA feature, if we had this feature, would have been a great feature to use. But with the features that we actually have, we had to decide whether to split on ear shape, which resulted in 4 out of 5 examples on the left being cats, and 1 out of 5 examples on the right being cats, or face shape, where it resulted in 4 out of 7 on the left and 1 out of 3 on the right, or whiskers, which resulted in 3 out of 4 examples being cats on the left and 2 out of 6 being not cats on the right. And so the decision tree learning algorithm has to choose between ear shape, face shape, and whiskers, which of these features results in the greatest purity of the labels on the left and right sub-branches. Because if you can get to a highly pure subset of examples, then you can either predict cat or predict not cat and get it mostly right. So the next video on entropy will talk about how to estimate impurity and how to minimize impurity. So the first decision we have to make when learning a decision tree is how to choose which feature to split on in each node. The second key decision you need to make when building a decision tree is to decide when do you stop splitting. The criteria that we used just now was until a node is either 100% all cats or 100% all dogs and not cats. Because at that point, it seems natural to build a leaf node that just makes a classification prediction. Alternatively, you might also decide to stop splitting when splitting a node further will result in the tree exceeding a maximum depth, where the maximum depth that you allow the tree to grow to is a parameter that you could decide. In a decision tree, the depth of a node is defined as the number of hops that it takes to get from the root node, that is the node at the very top, to that particular node. So the root node takes 0 hops to get to itself and is at depth 0. The nodes below it are at depth 1 and the nodes below it would be at depth 2. And so if you had decided that the maximum depth of the decision tree is, say, 2, then you would decide not to split any nodes below this level so that the tree never gets to depth 3. And one reason you might want to limit the depth of the decision tree is to make sure, first, the tree doesn't get too big and unwieldy, and second, by keeping the tree small, it makes it less prone to overfitting. Another criteria you might use to decide to stop splitting might be if the improvements in the purity score, which you'll see in a later video, are below a certain threshold. So if splitting a node results in minimum improvements to purity, or you'll see later, it actually decreases in impurity. But if the gains are too small, then you might not bother. Again, both to keep the tree smaller and to reduce the risk of overfitting. And finally, if the number of examples at a node is below a certain threshold, then you might also decide to stop splitting. So, for example, if at the root node we had split on the face shape feature, then the right branch would have had just three training examples with one cat and two dogs. And rather than splitting this into even smaller subsets, if you decided not to split further sets of examples with just three or fewer examples, then you would just create a decision node. And because there are mainly dogs, two out of three are dogs here, this would be a node that makes a prediction of not cat. And again, one reason you might decide this is not worth splitting on is to keep the tree smaller and to avoid overfitting. When I look at decision tree learning algorithms myself, sometimes I feel like, boy, there are a lot of different pieces, a lot of different things going on in this algorithm. Part of the reason it might feel like that is in the evolution of decision trees, there was one researcher that proposed a basic version of decision trees, and then a different researcher said, oh, we can modify this thing this way, such as here's a new criteria for splitting. Then a different researcher comes up with a different thing, like, oh, maybe we should stop splitting when it reaches a certain maximum depth. And over the years, different researchers came up with different refinements to the algorithm. As a result of that, it does work really well. But when you look at all the details of how to implement a decision tree, it feels like a lot of different pieces, such as why there's so many different ways to decide when to stop splitting. So if it feels like a somewhat complicated, messy algorithm to you, it does to me too. But these different pieces, they do fit together into a very effective learning algorithm. And what you learn in this course is the key, most important ideas on how to make it work well. And then at the end of this week, I'll also share with you some guidance, some suggestions for how to use open source packages so that you don't have to have too complicated a procedure for making all these decisions, like how do I decide to stop splitting so that you really get these algorithms to work well for yourself. But I want to reassure you that this algorithm seems complicated and messy. It frankly does to me too, but it does work well. Now, the next key decision that I want to dive more deeply into is how do you decide how to split at a node? So in the next video, let's take a look at this definition of entropy, which would be a way for us to measure purity or more precisely impurity in a node. Let's go on to the next video.