Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
You now know pretty much all the building blocks of building a full convolutional neural network. Let's look at an example. Let's say that you're inputting an image, which is 32 by 32 by 3. So it's an RGB image, and maybe you're trying to do handwritten digit recognition. So you have a number, like 7, in a 32 by 32 RGB image, and you're trying to recognize which one of the 10 digits from 0 to 9 this is. Let's build a neural network to do this. And what I'm going to use in this slide is inspired. It's actually quite similar to one of the classic neural networks called the Net5, which was created by Yann LeCun many years ago. What I'll show here isn't exactly the Net5, but it's inspired by it. But many of the parameter choices were inspired by it. So with a 32 by 32 by 3 input, let's say that the first layer uses a 5 by 5 filter, and a stride of 1, and no padding. So the output of this layer, if you use 6 filters, would be 28 by 28 by 6. I'm going to call this layer Conv1. So you'd apply 6 filters, add a bias, apply the nonlinearity, maybe a ReLU nonlinearity, and that's the Conv1 output. Next, let's apply a pooling layer. So I'm going to apply max pooling here. And let's use f equals 2, s equals 2. When I don't write a padding, it means the padding is equal to 0. Next, let's apply a pooling layer. I'm going to apply max pooling with a 2 by 2 filter and a stride equals 2. So this should reduce the height and width of the representation by a factor of 2. So 28 by 28 now becomes 14 by 14. And the number of channels remains the same, so 14 by 14 by 6. And we're going to call this the Pool1 output. So it turns out that in the literature of a ConvNet, there are two conventions which are slightly inconsistent about what you call a layer. One convention is that this is called one layer. So this would be layer 1 of the neural network. Another convention would be to count the Conv layer as a layer and the pool layer as a layer. When people report the number of layers in a neural network, usually people report just the number of layers that have weights, that have parameters. And because the pooling layer has no weights, has no parameters, only a few hyperparameters, I'm going to use the convention that the Conv1 and Pool1 here together, I'm going to treat that as layer 1. Although sometimes you see people, if you read articles online or read research papers, you hear about the Conv layer and the pooling layer as if they are two separate layers. But this is maybe two slightly inconsistent notation terminologies. But when I count layers, I'm just going to count layers that have weights. So I'm going to treat both of these together as layer 1. And the name Conv1 and Pool1 that I'm going to use here, the 1 at the end also refers to the fact that I view both of these as part of layer 1 of the neural network. And Pool1 is grouped into layer 1 because it doesn't have its own weights. Next, given a 14x14x6 volume, let's apply another convolutional layer to it. Let's use a filter size that's 5x5 and let's use a stride of 1. And let's use 10 filters this time. So now you end up with a 10x10x10 volume. So I'll call this Conv2. And then in this network, let's do max pooling with f equals 2, s equals 2 again. So you can probably guess the output of this. f equals 2, s equals 2. This should reduce the height and width by a factor of 2. So you're left with 5x5x10. And so I'm going to call this Pool2. And in our convention, this is layer 2 of the neural network. Now, let's apply another convolutional layer to this. I'm going to use a 5x5 filter. So f equals 5 and a stride is 1. And when I don't write the padding, it means there's no padding. And this will give you the Conv2 output. And let's use 16 filters. So this would be a 10x10x16 dimensional output. So it would look like that. And this is the Conv2 layer. And then let's apply max pooling to this with f equals 2, s equals 2. You can probably guess the output of this. 10x10x16 with max pooling with f equals 2, s equals 2. This will halve the height and width. You can probably guess the result of this. With max pooling with f equals 2, s equals 2, this should halve the height and width. So you end up with a 5x5x16 volume. Same number of channels as before. I'm going to call this Pool2. And in our convention, this is layer 2. Because this has one set of weights in the Conv2 layer. Now, 5x5x16, 5x5x16 is equal to 400. So let's now flatten out Pool2 into a 400 by 1 dimensional vector. So think of this as flattening this out into just a set of neurons, like so. And what we're going to do is then take this 400 units, and let's build the next layer as having 120 units. So this is actually our first fully connected layer. I'm going to call this FC3. Because we have 400 units densely connected to 120 units. So this fully connected unit, this fully connected layer, is just like the single neural network layer that you saw in courses 1 and 2, where this is just a standard neural network. Where you have a weight matrix, let's call it W3, of dimension 120 by 400. And this is called fully connected because each of the 400 units here is connected to each of the 120 units here. And you'd also have a bias parameter, that's going to be just 120 dimensional, because you have 120 outputs. And then lastly, let's take the 120 units and add another layer, this time a little bit smaller, but let's say we have 84 units here. We're going to call this fully connected layer 4. And finally, you now have 84 real numbers that you can feed to a softmax unit. And if you're trying to do handwritten digit recognition, you have to recognize, is it a handwritten 0, 1, 2, and so on up to 9, then this would be a softmax with 10 outputs. So this is a reasonably typical example of what a convolutional neural network might look like. And I know this seems like there are a lot of hyperparameters. We'll give you some more specific suggestions later for how to choose these types of hyperparameters. Maybe one common guideline is to actually not try to invent your own settings of hyperparameters, but to look in the literature to see what hyperparameters work for others and to just choose an architecture that has worked well for someone else and there's a chance that will work for your application as well. We'll say more about that next week. But for now, I just want to point out that as you go deeper in the neural network, usually nh and nw, the height and width, will decrease. I pointed this out earlier, but it goes from 32 by 32 to 20 by 20 to 14 by 14 to 10 by 10 to 5 by 5. So as you go deeper, usually the height and width will decrease, whereas the number of channels will increase. It's gone from 3 to 6 to 16, and then you have fully connected layers at the end. And another pretty common pattern you see in neural networks is to have conv layers, maybe one or more conv layers, followed by a pooling layer, and then one or more conv layers, followed by a pooling layer, and then at the end to have a few fully connected layers, and then followed by maybe a softmax. And this is another pretty common pattern you see in neural networks. So let's just go through, for this neural network, some more details of what are the activation shape, the activation size, and the number of parameters in this network. So the input was 32 by 32 by 3, and you multiply out those numbers, you should get 3072. So the activation, you know, A0 has dimension 3072. Well, it's really 32 by 32 by 3. And there are no parameters, I guess, in the input layer. And as you look at the different layers, feel free to work out the details yourself. These are the activation shape and the activation sizes of these different layers. So just I'll point out a few things. First, notice that the pooling layers, the max pooling layers, don't have any parameters. Second, notice that the conv layers tend to have relatively few parameters, as we discussed in an earlier video. And in fact, a lot of the parameters tend to be in the fully collected layers of the neural network. And then you notice also that the activation size tends to maybe go down gradually as you go deeper into the neural network. If it drops too quickly, that's usually not great for performance as well. So it starts in the first layer with 6000, then, you know, 1500, 1600, and then it kind of slowly falls into 84 until finally you have your softmax output. And you find that a lot of continents will have, you know, properties, will have patterns similar to these. So you've now seen the basic building blocks of neural networks, of convolutional neural networks, the conv layer, the pooling layer, and the fully collected layer. A lot of computer vision research has gone into figuring out how to put together these basic building blocks to build effective neural networks. And putting these things together actually requires quite a bit of insight. I think that one of the best ways to gain intuitions about how to put these things together is to see a number of concrete examples of how others have done it. So what I want to do next week is show you a few concrete examples, even beyond this first one that you just saw, on how people have successfully put these things together to build very effective neural networks. And through those videos next week, I hope that will help you hold your own intuitions about how these things are built, as well as give you concrete examples of architectures that maybe you can just use, you know, exactly as developed by someone else for your own application. So we'll do that next week. But before wrapping up this week's videos, just one last thing, which is I want to talk a little bit in the next video about why you might want to use convolutions, some of the benefits and advantages of using convolutions, as well as how to put it all together, how to take a neural network like the one you just saw and actually train it on a training set to perform image recognition or some of the tasks. So with that, let's go on to the last video of this week.