Hi, and welcome back. You've learned about the ResNet architecture, you've learned about InceptionNet. In this video, you'll learn about MobileNet, which is another foundational convolutional neural network architecture used for computer vision. Using MobileNet will allow you to build and deploy neural networks that work even in low-compute environments, such as mobile phone. Let's dive in. Why do you need another neural network architecture? It turns out a lot of neural networks you've learned about so far are quite computationally expensive. And if you want your neural network to run on a device with less powerful CPU or GPU at deployment, then there's another neural network architecture called the MobileNet that could perform much better. So I hope to share with you in this video how the depth-wise separable convolution works. Let's first revisit what the normal convolution does, and then we'll modify it to build the depth-wise separable convolution. In the normal convolution, you may have an input image that is some n by n by nc, where nc is the number of channels, so 6 by 6 by 3 channels in this case. And you want to convolve it with a filter that is f by f by nc, and in this case, it's 3 by 3 by 3. And the way you do this is you would take this filter, which I'm going to draw as a three-dimensional yellow block, and put the yellow filter over there. There are 27 multiplications you have to do, sum it up, and then that gives you this value. And then shift the filter over, multiply the 27 pairs of numbers, add that up, that gives you this number, and you keep going to get this, to get this, and then this, and so on, until you've computed all 4 by 4 output values. We didn't use padding in this case, and we used a stride of 1, which is why the output size, n out by n out, is a bit smaller than the input size. That is this 4 by 4 instead of 6 by 6. And rather than just having one of these 3 by 3 by 3 filters, you may have some nc prime filters, and if you have 5 of them, then the output will be 4 by 4 by 5, or n out by n out by nc prime. So let's figure out what is the computational cost of what we just did. It turns out the total number of computations needed to compute this output is given by the number of filter parameters, which is 3 by 3 by 3 in this case, multiplied by the number of filter positions, that is the number of places where we place this big yellow block, which is 4 by 4, and then multiplied by the number of filters, which is 5 in this case. And you can check for yourself that this is the total number of multiplications we need to do, because at each of the locations we plop down the filter, we need to do this many multiplications, and then we have this many filters. And if you multiply these numbers out, this turns out to be 2,160. So we'll come back. You'll see this number again later in this video when we come up with the depth-wise separable convolution that will be able to take as input a 6 by 6 by 3 image and output a 4 by 4 by 5 set of activations, but with fewer computations than 2,160. Let's see how the depth-wise separable convolution does that. In contrast to the normal convolution which you just saw, the depth-wise separable convolution has two steps. You're going to first use a depth-wise convolution followed by a point-wise convolution, and it is these two steps which together make up this depth-wise separable convolution. Let's see how each of these two steps work. In particular, let's flesh out the details of how the depth-wise convolution, the first of these two steps, works. As before, we have an input that is 6 by 6 by 3, so n by n by nc, three channels. The filter in the depth-wise convolution is going to be f by f, not f by f by nc, but just f by f. The number of filters is going to be nc, which in this case is 3. The way that you will compute the 4 by 4 by 3 output is that you apply one of each of these filters to one of each of these input channels. Let's step through this. First, let's focus on the first of the three filters, just the red one. We're going to take the red filter and position it there, carry out the nine multiplications. Notice there's only nine numbers you need to multiply, not 27, but 9, and add them up, and that will give you this value. Then shift it over, multiply the nine corresponding pairs of numbers, that gives you this value. Over here, gives you that, gives you that, and so on, until you get to the last of these 16 values. Next, we go to the second channel, and let's look at the green filter. The green filter, you position there, carry out the nine multiplications to compute this value, shift it over by one to compute this value, shift it over by one, and so on, and you do that 16 times until you've computed all of these values in the second channel of the output. Finally, you do this for the third channel. There's the blue filter, position it there to compute this value, shift it over, shift it over, shift it over, and so on, until you've computed all 16 outputs in that third channel as well. The size of the output after this step will be N out by N out, 4 by 4, by N C, where N C is the same as this N C, the number of channels in your original input. Let's look at the computational cost of what we've just done, because to generate each of these output values, each of these 4 by 4 by 3 output values, it requires nine multiplications. And so the total computational cost is 3 by 3 times the number of filter positions, that is the number of positions each of these filters was placed on top of the image on the left, so that was 4 by 4, and then finally times the number of filters, which is 3. Another way to look at this is that you have 4 by 4 by 3 outputs, and for each of those outputs, you needed to carry out nine multiplications. So the total computational cost is 3 times 3 times 4 times 4 times 3, and if you multiply out these numbers, it turns out to be 432. But we're not yet done, this is a depth-wise convolution part of the depth-wise separable convolution, there's one more step, which is we need to take this 4 by 4 by 3 intermediate value and carry out one more step. The remaining step is to take this 4 by 4 by 3 set of values, or N out by N out by N C set of values, and apply a point-wise convolution in order to get the output we want, which will be 4 by 4 by 5. So let's see how the point-wise convolution works. Here's the point-wise convolution, we are going to take the intermediate set of values, which is N out by N out by N C, and convolve it with a filter that is 1 by 1 by N C, 1 by 1 by 3 in this case. Well, you take this pink filter, this pink 1 by 1 by 3 block, and apply it at the upper left most position, carry out the three multiplications, add them up, and that gives you this value, shift it over by 1, multiply the three pairs of numbers, add them up, that gives you that value, and so on, and you keep going until you've filled out all 16 values of this output. Now we've done this with just one filter, in order to get another 4 by 4 output, but a 4 by 4 by 5 dimensional output, you would actually do this with N C prime. Filters, in this case, N C prime was set to 5, so with 5 of these 1 by 1 by 3 filters, you end up with a 4 by 4 by 5 output as follows, to give you an N out by N out by N C prime dimensional output, and so the point-wise convolution gives you the 4 by 4 by 5 output. Let's figure out the computational cost of what we just did. For every one of these 4 by 4 by 5 output values, we had to apply this pink filter to part of the input, and that costs 3 operations, or 1 by 1 by 3, which is the number of filter parameters. The filter had to be placed in 4 by 4 different positions, and we had 5 filters. So the total cost of what we just did here is 1 times 1 times 3 times 4 times 4 times 5, which is 240 multiplications. In the example we just walked through, the normal convolution took as input a 6 by 6 by 3 input and wound up with a 4 by 4 by 5 output, and same for the depth-wise separable convolution, except we did it in two steps, with a depth-wise convolution followed by a point-wise convolution. Now, what were the computational costs of all of these operations? In the case of the normal convolution, we needed 2,160 multiplications to compute the output. For the depth-wise separable convolution, there was first the depth-wise step, where we had, from earlier in the video, 432 multiplications, and then the point-wise step, where we had 240 multiplications, and so adding these up, we wound up with 672 multiplications. If we look at the ratio between these two numbers, 672 over 2,160, this turns out to be about 0.31. So, in this example, the depth-wise separable convolution was about 31% as computationally expensive as the normal convolution, so roughly a 3x savings. The authors of the MobileNet's paper showed that, in general, the ratio of the cost of the depth-wise separable convolution compared to the normal convolution, that turns out to be equal to 1 over NC prime plus 1 over F squared in the general case. And in our case, this was 1 over 5 plus 1 over 3 squared, or 1 over 9, which is 0.31. In a more typical neural network example, NC prime will be much bigger, so it may be, say, 1 over 512. If you have 512 channels in your output, plus 1 over 3 squared, this would be a fairly typical parameter in a neural network. And this is really small, and this is 1 9th. And so, very roughly, the depth-wise separable convolution may be about 1 9th, or maybe rounding up roughly 10 times cheaper in computational costs. And that's why the depth-wise separable convolution as a building block of a ConvNet allows you to carry out inference much more efficiently than using a normal convolution. Now, there's just one more detail I want to share with you before we wrap up this video. In the example that we went through, the input here was 6 by 6 by NC, where NC was equal to 3, and thus you had 3 by 3 by NC filters here. Now, the depth-wise separable convolution works for any number of input channels. So if you had 6 input channels, then NC would be equal to 6, and you would also then have to have 3 by 3 by 6 filters. And the intermediate output will continue to be 4 by 4 by C. That becomes 4 by 4 by 6. Now, something looks wrong with this diagram, doesn't it? Which is that this should be 3 by 3 by 6, not 3 by 3 by 3. But in order to make the diagrams in the next video look a little bit simpler, even when the number of channels is greater than 3, I'm still going to draw the depth-wise convolution operation as if it was this stack of 3 filters. So when you see this exact icon later, think of that as the icon we're using to denote a depth-wise convolution, rather than a very literal exact visualization of the number of channels of the depth-wise convolution filter. And I'm going to use a similar icon to denote point-wise convolution. In this example, the input here would be 4 by 4 by NC, so it's really this value from up here. And rather than expanding this stack of values to be 1 by 1 by NC, I'm going to continue to use this pink set of filters that look like that. And even if NC is much larger, I'm still going to draw it as if it looks like it only has 3 filters, to make some of the diagrams look simpler. And this can give you the output of whatever is the necessary dimension, such as 4 by 4 by 8, by some other value of NC prime in this case. So that's it. You've learned about the depth-wise separable convolution, which comprises two main steps, the depth-wise convolution and the point-wise convolution. And this operation can be designed to have the same input and output dimensions as the normal convolutional operation, but it can be done at much lower computational cost. Let's now take this building block and use it to build the mobile net. We'll do that in the next video.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Deep Convolutional Models: Case Studies

Case Studies

Why look at case studies?
Video
・
2 mins

Classic Networks
Video
・
18 mins

ResNets
Video
・
7 mins

Why ResNets Work?
Video
・
9 mins

Networks in Networks and 1x1 Convolutions
Video
・
6 mins

Clarifications about Upcoming Inception Network Motivation Video
Reading
・
1 min

Inception Network Motivation
Video
・
10 mins

Inception Network
Video
・
8 mins

MobileNet
Video
・
16 mins

MobileNet Architecture
Video
・
8 mins

EfficientNet
Video
・
3 mins

Practical Advice for Using ConvNets

Using Open-Source Implementation
Video
・
4 mins

Transfer Learning
Video
・
8 mins

Data Augmentation
Video
・
9 mins

State of Computer Vision
Video
・
12 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Deep Convolutional Models

Graded・Quiz

・

50 mins

Programming Assignments

Note on the Upcoming Programming Assignment - Residual Networks
Reading
・
1 min

Residual Networks

Graded・Code Assignment

・

3 hours

Transfer Learning with MobileNet

Graded・Code Assignment

・

3 hours

Week 3: Object Detection