Other than convolutional layers, confidence often also uses pooling layers to reduce the size of the representation to speed up computation, as well as make some of the features it detects a bit more robust. Let's take a look. Let's go through an example of pooling, and then we'll talk about why you might want to do this. Suppose you have a 4x4 input, and you want to apply a type of pooling called max pooling. And the output of this particular implementation of max pooling will be a 2x2 output. And the way you do that is quite simple. Take your 4x4 input and break it into different regions, and I'm going to color the four regions as follows. And then in the output, which is 2x2, each of the outputs will just be the max from the correspondingly shaded region. So the upper left, I guess the max of these four numbers is 9. Upper right, the max of the blue numbers is 2. Lower left, the biggest number is 6. And lower right, the biggest number is 3. So to compute each of the numbers on the right, we took the max over a 2x2 region. So this is as if you're applying a filter size of 2, because you're taking 2x2 regions, and you're taking a stride of 2. So these are actually the hyperparameters of max pooling. Because we start from this filter size, it's like a 2x2 region, that gives you the 9. And then you step it over 2 steps to look at this region to give you the 2. And then for the next row, you step it down 2 steps to give you the 6. And then step it to the right by 2 steps to give you 3. So because the squares are 2x2, f is equal to 2. And because you stride by 2, s is equal to 2. So here's the intuition behind what max pooling is doing. If you think of this 4x4 region as some set of features, deactivations in some layer of the neural network, then a large number means that it's maybe detected a particular feature. So the upper left-hand quadrant has this particular feature. Maybe a vertical edge, or maybe an eye, or a whisker if you're trying to detect the cat. But clearly that feature exists in the upper left-hand quadrant. Whereas this feature, maybe there's a cat eye detector. Whereas this feature doesn't really exist in the upper right-hand quadrant. So what the max operation does is so long as the feature is detected anywhere in one of these quadrants, it then remains preserved in the output of max pooling. So what the max operator does is really say, if this feature is detected anywhere in this filter, then keep a high number. But if this feature is not detected, so maybe if this feature doesn't exist in the upper right-hand quadrant, then the max of all those numbers is still itself quite small. So maybe that's the intuition behind max pooling. But I have to admit, I think the main reason people use max pooling is because it's been found in a lot of experiments to work well. And the intuition I just described, despite it being often cited, I don't know if anyone fully knows if that's the real underlying reason. I don't know if anyone knows if that's the real underlying reason that max pooling works well in confidence. One interesting property of max pooling is that it has a set of hyperparameters, but it has no parameters to learn. There's actually nothing for gradient descent to learn. Once you fix f and s, it's just a fixed computation, and gradient descent doesn't change anything. Let's go through an example with some different hyperparameters. Here you have a 5x5 input, and we're going to apply max pooling with a filter size that's 3x3, so f is equal to 3. And let's use a stride of 1. So in this case, the output size is going to be 3x3. And the formulas we have developed in the previous videos for figuring out the output size for a conv layer, those formulas also work for max pooling. So that's n plus 2p minus f over s plus 1, that formula also works for figuring out the output size of max pool. But in this example, let's compute each of the elements of this 3x3 output. The upper left hand element, we're going to look over that region. So notice this is a 3x3 region, because the filter size is 3. And taking max there, so that's going to be 9. And then we shift it over by 1, because we're taking a stride of 1. So that max in the blue box is 9. Let's shift that over again. The max in the blue box is 5. And then let's go on to the next row. A stride of 1, so we're just stepping down by one step. So max in that region is 9. Max in that region is 9. Max in that region, oh, there are two 5s, but the max is still 5. And then finally, max in that is 8. Max in that is 6. And max in that is 9 in the lower right hand corner. So this, with this set of hyperparameters, f equals 3, s equals 1, gives that output shown on it. Now, so far, I've shown max pooling on a 2D input. If you have a 3D input, then the output will have the same dimension. So, for example, if you have 5x5x2, then the output will be 3x3x2. And the way you compute max pooling is you perform the computation we just described on each of the channels independently. So the first channel, which is shown here on top, is still the same. And then for the second channel, I guess this one that I just drew at the bottom, you would do the same computation on that slice of this volume. And that gives you this second slice. And more generally, if this was 5x5 by some number of channels, the output would be 3x3 by that same number of channels. And the max pooling computation is done independently on each of these NC channels. So that's max pooling. There's one other type of pooling that isn't used very often, but I'll mention briefly, which is average pooling. So that's pretty much what you'd expect, which is instead of taking the maxes within each filter, you take the average. So in this example, the average of the numbers in purple is 3.75. Then there's 1.25, and 4, and 2. And so this is average pooling with hyperparameters f equals 2, s equals 2. You can choose other hyperparameters as well. So these days, max pooling is used much more often than average pooling, with one exception, which is sometimes, very deep in the neural network, you might use average pooling to collapse your representation from, say, 7x7 by 1000, and average over all the spatial extents to get 1x1x1000. We'll see an example of this later. But you see max pooling used much more in the neural network than average pooling. So just to summarize, the hyperparameters for pooling are f, the filter size, and s, the stride. And maybe common choices of parameters might be f equals 2, s equals 2. This is used quite often, and this has the effect of roughly shrinking the height and width by a factor of about 2. And the common choice of hyperparameters might be f equals 2, s equals 2, and this has the effect of shrinking the height and width of the representation by a factor of 2. I've also seen f equals 3, s equals 2 used. And then the other hyperparameter is just the binary bit that says, are you using max pooling or are you using average pooling? If you want, you can add an extra hyperparameter for the padding, although this is very, very rarely used. When you do max pooling, usually you do not use any padding, although there is one exception that we'll see next week as well. But for the most part, max pooling usually does not use any padding. So the most common value of p by far is p equals 0. And the input of max pooling is that you input a volume of size that, nh by nw by nc, and it would output a volume of size given by this. So assuming there's no padding, by nw minus f over s, that's 1, 4, by nc. So the number of input channels is equal to the number of output channels because pooling applies to each of your channels independently. One thing to note about pooling is that there are no parameters to learn. So when you implement backprop, you find that there are no parameters that backprop will adapt through max pooling. Instead, there are just these hyperparameters that you set once, maybe set once by hand or set using cross-validation, and then beyond that, you're done. It's just a fixed function that the neural network computes in one of the layers, and there is actually nothing to learn. It's just a fixed function. So that's it for pooling. You now know how to build convolutional layers and pooling layers. In the next video, let's see a more complex example of a conf net, one that will also allow us to introduce fully connected layers.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 1: Foundations of Convolutional Neural Networks

Convolutional Neural Networks

Computer Vision
Video
・
5 mins

Edge Detection Example
Video
・
11 mins

More Edge Detection
Video
・
7 mins

Padding
Video
・
9 mins

Strided Convolutions
Video
・
8 mins

Convolutions Over Volume
Video
・
10 mins

One Layer of a Convolutional Network
Video
・
16 mins

Clarifications about Upcoming Simple Convolutional Network Example Video
Reading
・
1 min

Simple Convolutional Network Example
Video
・
8 mins

Pooling Layers
Video
・
10 mins

Clarifications about Upcoming CNN Example Video
Reading
・
1 min

CNN Example
Video
・
12 mins

Clarifications about Upcoming Why Convolutions?
Reading
・
1 min

Why Convolutions?
Video
・
9 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Quiz

The Basics of ConvNets

Graded・Quiz

・

50 mins

Programming Assignments

(Optional) Downloading your Notebook and Refreshing your Workspace
Reading
・
5 mins

Convolutional Model, Step by Step

Graded・Code Assignment

・

3 hours

Convolution Model Application

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yann LeCun Interview
Video
・
27 mins

Week 2: Deep Convolutional Models: Case Studies