Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
When building a decision tree, the way we'll decide what feature to split on at a node will be based on what choice of feature reduces entropy the most. Reduces entropy, or reduces impurity, or maximizes purity. In decision tree learning, the reduction of entropy is called information gain. Let's take a look in this video at how to compute information gain, and therefore choose what feature to use to split on at each node in your decision tree. Let's use the example of deciding what feature to use at the root node of the decision tree we're building just now, or recognizing cats versus not cats. If we had split using the ear shape feature at the root node, this is what we would have gotten. Five examples on the left, and five on the right. And on the left, we would have four out of five cats, so P1 would be equal to four-fifths, or 0.8. And on the right, one out of five are cats, so P1 is equal to one-fifth, or 0.2. If you apply the entropy formula from the last video to this left subset of data and this right subset of data, we find that the degree of impurity on the left is entropy of 0.8, which is about 0.72, and on the right, the entropy of 0.2 turns out also to be 0.72. So this would be the entropy at the left and right sub-branches if we were to split on the ear shape feature. One other option would be to split on the face shape feature. If we've done so, then on the left, for the seven examples, would be cats, so P1 is four-sevenths, and on the right, one-third are cats. So P1 on the right is one-third, and the entropy of four-sevenths and the entropy of one-third are 0.99 and 0.92. So the degree of impurity in the left and right node seems much higher, 0.99 and 0.92, compared to 0.72 and 0.72. Finally, the third possible choice of feature to use at the root node would be the whiskers feature, in which case you split based on whether whiskers are present or absent. In this case, P1 on the left is three-quarters, P1 on the right is two-sixths, and the entropy values are as follows. So the key question we need to answer is, given these three options of a feature to use at the root node, which one do we think works best? It turns out that rather than looking at these entropy numbers and comparing them, it would be useful to take a weighted average of them, and here's what I mean. If there's a node with a lot of examples in it, with high entropy, that seems worse than if there was a node with just a few examples in it with high entropy, because entropy as a measure of impurity is worse if you have a very large and impure dataset, compared to just a few examples in a branch of the tree that is very impure. So the key decision is, of these three possible choices of features to use at the root node, which one do we want to use? What we're going to do with each of these splits is two numbers, the entropy on the left sub-branch and the entropy on the right sub-branch, and in order to pick from these, we'd like to actually combine these two numbers into a single number, so we can just pick of these three choices which one works best. And the way we're going to combine these two numbers is by taking a weighted average, because how important it is to have low entropy in, say, the left or right sub-branch also depends on how many examples went into the left or right sub-branch, because if there are a lot of examples in, say, the left sub-branch, then it seems more important to make sure that that left sub-branch's entropy value is low. So in this example, we have 5 of the 10 examples went to the left sub-branch, so we can compute the weighted average as 5 of the 10 times the entropy of 0.8, and then add to that 5 of the 10 examples also went to the right sub-branch, plus 5 tenths times the entropy of 0.2. Now for this example in the middle, the left sub-branch had received 7 out of 10 examples, and so we're going to compute 7 tenths times the entropy of 0.57, plus the right sub-branch had 3 out of 10 examples, so plus 3 tenths times the entropy of 0.33 of 1 third. And finally, on the right, we'll compute 4 tenths times the entropy of 0.75, plus 6 tenths times the entropy of 0.33. And so the way we will choose a split is by computing these 3 numbers and picking whichever one is lowest, because that gives us the left and right sub-branches with the lowest average weighted entropy. In the way that decision trees are built, we're actually going to make one more change to these formulas to stick to the convention in decision tree building, but it won't actually change the outcome, which is rather than computing this weighted average entropy, we're going to compute the reduction in entropy compared to if we hadn't split at all. So if we go to the root note, remember that at the root note we had started off with all 10 examples of the root note, with 5 cats and 5 dogs, and so at the root note we had p1 equals 5 tenths, or 0.5, and so the entropy of the root note, entropy of 0.5, was actually equal to 1. This was maximum impurity because it was 5 cats and 5 dogs. So the formula that we're actually going to use for choosing a split is not this weighted entropy at the left and right sub-branches, instead it's going to be the entropy at the root note, which is entropy of 0.5, then minus this formula. And in this example, if you work out the math, it turns out to be 0.28. For the face shape example, we again compute entropy at the root note, entropy of 0.5, minus this, which turns out to be 0.03, and for whiskers, compute that, which turns out to be 0.12. These numbers that we just calculated, 0.28, 0.03, and 0.12, these are called the information gain, and what it measures is the reduction in entropy that you get in your tree, resulting from making a split. Because the entropy was originally 1 at the root note, and by making the split, you end up with a lower value of entropy, and the difference between those two values is the reduction in entropy, and that's 0.28 in the case of splitting on the ear shape. So why do we bother to compute reduction in entropy rather than just entropy at the left and right sub-branches? It turns out that one of the stopping criteria for deciding when to not bother to split any further is if the reduction in entropy is too small, in which case you could decide you're just increasing the size of the tree unnecessarily and risking overfitting by splitting and just decide to not bother if the reduction in entropy is too small below a threshold. In this example, splitting on ear shape results in the biggest reduction in entropy, 0.28, is bigger than 0.03 or 0.12, and so we would choose to split on the ear shape feature at the root note. On the next slide, let's give a more formal definition of information gain. And by the way, one additional piece of notation that we'll introduce also in the next slide is these numbers, 5 tenths and 5 tenths, I'm going to call this w left because that's the fraction of examples that went to the left branch, and I'm going to call this w right because that's the fraction of examples that went to the right branch, whereas for this another example, w left would be 7 tenths and w right would be 3 tenths. So let's now write down the general formula for how to compute information gain. Using the example of splitting on the ear shape feature, let me define p1 left to be equal to the fraction of examples in the left subtree that have a positive label, that are cats. And so in this example, p1 left would be equal to 4 fifths, and also let me define w left to be the fraction of examples out of all the examples at the root note that went to the left sub branch, and so in this example, w left would be 5 tenths. Similarly, let's define p1 right to be, of all the examples in the right branch, the fraction that are positive examples, and so if one out of five of these examples being cats, that would be 1 fifth, and similarly w right is 5 tenths. The fraction of examples that went to the right sub branch. And let's also define p1 root to be the fraction of examples that are positive in the root note. So in this case, this would be 5 tenths, or 0.5. Information gain is then defined as the entropy of p1 root, so what's the entropy at the root note, minus that weighted entropy calculation that we had on the previous slide, minus w left, this would be 5 tenths in the example, times the entropy applied to p1 left, that's entropy on the left sub branch, plus w right, the fraction of examples that went to the right branch, times entropy of p1 right. And so with this definition of entropy, you can calculate the information gain associated with choosing any particular feature to split on in the note, and then out of all the possible features you could choose to split on, you can then pick the one that gives you the highest information gain, and that will result in hopefully increasing the purity of your subset of data that you get on the left and right sub branches of your decision tree. And that will result in choosing a feature to split on that increases the purity of your subset of data in both the left and right sub branches of your decision tree. Now that you know how to calculate information gain, or reduction in entropy, you know how to pick a feature to split on at a note. Let's put all the things we've talked about together into the overall algorithm for building a decision tree, given a training set. Let's go see that in the next video.