Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Other than convolutional layers, confidence often also uses pooling layers to reduce the size of the representation to speed up computation, as well as make some of the features it detects a bit more robust. Let's take a look. Let's go through an example of pooling, and then we'll talk about why you might want to do this. Suppose you have a 4x4 input, and you want to apply a type of pooling called max pooling. And the output of this particular implementation of max pooling will be a 2x2 output. And the way you do that is quite simple. Take your 4x4 input and break it into different regions, and I'm going to color the four regions as follows. And then in the output, which is 2x2, each of the outputs will just be the max from the correspondingly shaded region. So the upper left, I guess the max of these four numbers is 9. Upper right, the max of the blue numbers is 2. Lower left, the biggest number is 6. And lower right, the biggest number is 3. So to compute each of the numbers on the right, we took the max over a 2x2 region. So this is as if you're applying a filter size of 2, because you're taking 2x2 regions, and you're taking a stride of 2. So these are actually the hyperparameters of max pooling. Because we start from this filter size, it's like a 2x2 region, that gives you the 9. And then you step it over 2 steps to look at this region to give you the 2. And then for the next row, you step it down 2 steps to give you the 6. And then step it to the right by 2 steps to give you 3. So because the squares are 2x2, f is equal to 2. And because you stride by 2, s is equal to 2. So here's the intuition behind what max pooling is doing. If you think of this 4x4 region as some set of features, deactivations in some layer of the neural network, then a large number means that it's maybe detected a particular feature. So the upper left-hand quadrant has this particular feature. Maybe a vertical edge, or maybe an eye, or a whisker if you're trying to detect the cat. But clearly that feature exists in the upper left-hand quadrant. Whereas this feature, maybe there's a cat eye detector. Whereas this feature doesn't really exist in the upper right-hand quadrant. So what the max operation does is so long as the feature is detected anywhere in one of these quadrants, it then remains preserved in the output of max pooling. So what the max operator does is really say, if this feature is detected anywhere in this filter, then keep a high number. But if this feature is not detected, so maybe if this feature doesn't exist in the upper right-hand quadrant, then the max of all those numbers is still itself quite small. So maybe that's the intuition behind max pooling. But I have to admit, I think the main reason people use max pooling is because it's been found in a lot of experiments to work well. And the intuition I just described, despite it being often cited, I don't know if anyone fully knows if that's the real underlying reason. I don't know if anyone knows if that's the real underlying reason that max pooling works well in confidence. One interesting property of max pooling is that it has a set of hyperparameters, but it has no parameters to learn. There's actually nothing for gradient descent to learn. Once you fix f and s, it's just a fixed computation, and gradient descent doesn't change anything. Let's go through an example with some different hyperparameters. Here you have a 5x5 input, and we're going to apply max pooling with a filter size that's 3x3, so f is equal to 3. And let's use a stride of 1. So in this case, the output size is going to be 3x3. And the formulas we have developed in the previous videos for figuring out the output size for a conv layer, those formulas also work for max pooling. So that's n plus 2p minus f over s plus 1, that formula also works for figuring out the output size of max pool. But in this example, let's compute each of the elements of this 3x3 output. The upper left hand element, we're going to look over that region. So notice this is a 3x3 region, because the filter size is 3. And taking max there, so that's going to be 9. And then we shift it over by 1, because we're taking a stride of 1. So that max in the blue box is 9. Let's shift that over again. The max in the blue box is 5. And then let's go on to the next row. A stride of 1, so we're just stepping down by one step. So max in that region is 9. Max in that region is 9. Max in that region, oh, there are two 5s, but the max is still 5. And then finally, max in that is 8. Max in that is 6. And max in that is 9 in the lower right hand corner. So this, with this set of hyperparameters, f equals 3, s equals 1, gives that output shown on it. Now, so far, I've shown max pooling on a 2D input. If you have a 3D input, then the output will have the same dimension. So, for example, if you have 5x5x2, then the output will be 3x3x2. And the way you compute max pooling is you perform the computation we just described on each of the channels independently. So the first channel, which is shown here on top, is still the same. And then for the second channel, I guess this one that I just drew at the bottom, you would do the same computation on that slice of this volume. And that gives you this second slice. And more generally, if this was 5x5 by some number of channels, the output would be 3x3 by that same number of channels. And the max pooling computation is done independently on each of these NC channels. So that's max pooling. There's one other type of pooling that isn't used very often, but I'll mention briefly, which is average pooling. So that's pretty much what you'd expect, which is instead of taking the maxes within each filter, you take the average. So in this example, the average of the numbers in purple is 3.75. Then there's 1.25, and 4, and 2. And so this is average pooling with hyperparameters f equals 2, s equals 2. You can choose other hyperparameters as well. So these days, max pooling is used much more often than average pooling, with one exception, which is sometimes, very deep in the neural network, you might use average pooling to collapse your representation from, say, 7x7 by 1000, and average over all the spatial extents to get 1x1x1000. We'll see an example of this later. But you see max pooling used much more in the neural network than average pooling. So just to summarize, the hyperparameters for pooling are f, the filter size, and s, the stride. And maybe common choices of parameters might be f equals 2, s equals 2. This is used quite often, and this has the effect of roughly shrinking the height and width by a factor of about 2. And the common choice of hyperparameters might be f equals 2, s equals 2, and this has the effect of shrinking the height and width of the representation by a factor of 2. I've also seen f equals 3, s equals 2 used. And then the other hyperparameter is just the binary bit that says, are you using max pooling or are you using average pooling? If you want, you can add an extra hyperparameter for the padding, although this is very, very rarely used. When you do max pooling, usually you do not use any padding, although there is one exception that we'll see next week as well. But for the most part, max pooling usually does not use any padding. So the most common value of p by far is p equals 0. And the input of max pooling is that you input a volume of size that, nh by nw by nc, and it would output a volume of size given by this. So assuming there's no padding, by nw minus f over s, that's 1, 4, by nc. So the number of input channels is equal to the number of output channels because pooling applies to each of your channels independently. One thing to note about pooling is that there are no parameters to learn. So when you implement backprop, you find that there are no parameters that backprop will adapt through max pooling. Instead, there are just these hyperparameters that you set once, maybe set once by hand or set using cross-validation, and then beyond that, you're done. It's just a fixed function that the neural network computes in one of the layers, and there is actually nothing to learn. It's just a fixed function. So that's it for pooling. You now know how to build convolutional layers and pooling layers. In the next video, let's see a more complex example of a conf net, one that will also allow us to introduce fully connected layers.