When designing a layer for a ConvNet, you might have to pick, do you want a 1x3 filter, or a 3x3, or a 5x5, or do you want a pooling layer? What the Inception network does is it says, why don't you do them all? And this makes the network architecture more complicated, but it also works remarkably well. Let's see how this works. Let's say for the sake of example that you have input a 28x28x192 dimensional volume. So what the Inception network, or what an Inception layer says is, instead of choosing what filter size you want in a Conv layer, or even do you want a convolutional layer or pooling layer, let's do them all. So what if you can use a 1x1 convolution, and that will output a 28x28xsomething, let's say 28x28x64 output, and you just have a volume there. But maybe you also want to try a 3x3, and that might output a 28x28x128. And then what you do is just stack up this second volume next to the first volume, and to make the dimensions match up, let's make this a same convolution. So the output dimension is still 28x28, same as the input dimension in terms of height and width, but 28x28x128. And maybe you might say, well, I want to hedge my bets, maybe a 5x5 filter works better, so let's do that too, and have that output a 28x28x32. And again, you use the same convolution to keep the dimensions the same. And maybe you don't want a convolutional layer, let's apply pooling, and that has some other output, and let's stack that up as well. And here, pooling outputs 28x28x32. Now, in order to make all the dimensions match, you actually need to use padding for max pooling. So this is an unusual form of pooling, because if you want the input to have height and width 28x28, and have the output match the dimension of everything else, also by 28x28, then you need to use the same padding as well as a stride of 1 for pooling. So this detail might seem a bit funny to you now, but let's keep going, and we'll make this all work later. But with an inception module like this, you can input some volume, and output in this case, I guess if you add up all these numbers, 32 plus 32 plus 128 plus 64, that's equal to 256. So you will have one inception module input 28x28x192, and output 28x28x256. And this is the heart of the inception network, which is due to Christian Zegedi, Wei Liu, Yanqin Xia, Pia Semeny, Scott Reed, Dragoma Angelov, Dimitri Erhon, Vincent Van Cook, and Andrew Rabinovich. And the basic idea is that instead of you needing to pick one of these filter sizes or pooling you want, and committing to that, you can do them all, and just concatenate all the outputs, and let the network learn whatever parameters it wants to use whatever combinations of these filter sizes it wants. Now it turns out that there's a problem with the inception layer as we've described it here, which is computational cost. On the next slide, let's figure out what's the computational cost of this 5x5 filter resulting in this block over here. So just focusing on the 5x5 part on the previous slide, we had as input a 28x28x192 block, and you implement a 5x5 same convolution with 32 filters to output 28x28x32. On the previous slide, I had drawn this as a thin purple slide, so I'm just going to draw this as a more normal looking blue block here. So let's look at the computational cost of outputting this 28x28x32. So you have 32 filters, because the output has 32 channels, and each filter is going to be 5x5x192. And so the output size is 28x28x32, and so you need to compute 28x28x32 numbers. And for each of them, you need to do this many multiplications, 5x5x192. So the total number of multipliers you need is the number of multipliers you need to compute each of the output values times the number of output values you need to compute. And if you multiply out all these numbers, this is equal to 120 million. And so while you can do 120 million multipliers on a modern computer, this is still a pretty expensive operation. On the next slide, you see how using the idea of 1x1 convolutions, which you learned about in the previous video, you'd be able to reduce the computational cost by about a factor of 10 to go from about 120 million multipliers to about 1 tenth of that. So please remember the number 120 so you can compare it with what you see on the next slide, 120 million. Here's an alternative architecture for inputting 28x28x192 and outputting 28x28x32, which is following. You're going to input the volume, use a 1x1 convolution to reduce the volume to 16 channels instead of 192 channels. And then on this much smaller volume, run your 5x5 convolution to give you your final output. So notice the input and output dimensions are still the same. You input 28x28x192 and output 28x28x32, same as the previous slide. But what we've done is we've taken this huge volume we had on the left and we've shrunk it to this much smaller intermediate volume, which only has 16 instead of 192 channels. Sometimes this is called a bottleneck layer, because a bottleneck is usually the smallest part of something. So I guess if you have a glass bottle that looks like this, then you know, this is I guess where the cork goes. Then the bottleneck is the smallest part of this bottle. So in the same way, the bottleneck layer is the smallest part of this network. We shrink the representation before increasing the size again. Now let's look at the computational costs involved. To apply this 1x1 convolution, we have 16 filters. Each of the filters is going to be of dimension 1x1x192. This 192 matches that 192. And so the cost of computing this 28x28x16 volume is going to be, well you need this many outputs. And for each of them, you need to do 192 multiplications. I could have written 1x1x192. This is this. And if you multiply this out, this is 2.4 million. It's about 2.4 million. How about the second? So that's the cost of this first convolutional layer. The cost of this second convolutional layer will be that. Well, you have this many outputs. So 28x28x32. And then for each of the outputs, you have to apply a 5x5x16 dimensional filter. So 5x5x16. And you multiply that out is equal to 10.0. And so the total number of multiplications you need to do is the sum of those, which is 12.4 million multiplications. And you compare this with what we had on the previous slide. You reduce the computational cost from about 120 million multiplies down to about one-tenth of that, to 12.4 million multiplications. Oh, and the number of additions you need to do is about very similar to the number of multiplications you need to do. So that's why I'm just counting the number of multiplications. So to summarize, if you're building a layer of a neural network and you don't want to have to decide, do you want a 1x1 or 3x3 or 5x5 or pooling layer, the inception module lets you say, let's do them all and let's concatenate the results. And then we ran into the problem of computational cost, and what you saw here was how using a 1x1 convolution, you can create this bottleneck layer, thereby reducing the computational cost significantly. Now, you might be wondering, does shrinking down the representation size so dramatically, does it hurt the performance of your neural network? It turns out that so long as you implement this bottleneck layer sort of within reason, you can shrink down the representation size significantly, and it doesn't seem to hurt the performance, but saves you a lot of computation. So these are the key ideas of the inception module. Let's put them together and in the next video show you what the full inception network looks like.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Deep Convolutional Models: Case Studies

Case Studies

Why look at case studies?
Video
・
2 mins

Classic Networks
Video
・
18 mins

ResNets
Video
・
7 mins

Why ResNets Work?
Video
・
9 mins

Networks in Networks and 1x1 Convolutions
Video
・
6 mins

Clarifications about Upcoming Inception Network Motivation Video
Reading
・
1 min

Inception Network Motivation
Video
・
10 mins

Inception Network
Video
・
8 mins

MobileNet
Video
・
16 mins

MobileNet Architecture
Video
・
8 mins

EfficientNet
Video
・
3 mins

Practical Advice for Using ConvNets

Using Open-Source Implementation
Video
・
4 mins

Transfer Learning
Video
・
8 mins

Data Augmentation
Video
・
9 mins

State of Computer Vision
Video
・
12 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Deep Convolutional Models

Graded・Quiz

・

50 mins

Programming Assignments

Note on the Upcoming Programming Assignment - Residual Networks
Reading
・
1 min

Residual Networks

Graded・Code Assignment

・

3 hours

Transfer Learning with MobileNet

Graded・Code Assignment

・

3 hours

Week 3: Object Detection