You've seen how convolutions over 2D images work. Now, let's see how you can implement convolutions over not just 2D images, but over three-dimensional volumes. Let's start with an example. Let's say you want to detect features not just in a grayscale image, but in an RGB image. So an RGB image might be, instead of a 6x6 image, it could be 6x6x3, where the 3 here corresponds to the 3 color channels. So think of this as a stack of three 6x6 images. In order to detect edges or some other feature in this image, you convolve this, not with a 3x3 filter as you had previously, but now with also a 3D filter that's going to be 3x3x3. So the filter itself will also have three layers corresponding to the red, green, and blue channels. So to give these things some names, this first 6 here, that's the height of the image, that's the width, and this 3 is the number of channels. And your filter also similarly has a height, a width, and a number of channels. And the number of channels in your image must match the number of channels in your filter, so these two numbers have to be equal. We'll see on the next slide how this convolution operation actually works, but the output of this will be a 4x4 image. And notice this is 4x4x1, there's no longer a 3 at the end. Let's go through in detail how this works, but let's use a more nicely drawn image. So here's the 6x6x3 image, and here's the 3x3x3 filter, and this last number, the number of channels, matches between the image and the filter. So to simplify the drawing of this 3x3x3 filter, instead of drawing it as a stack of three matrices, I'm also going to sometimes just draw it as this three-dimensional cube like that. So to compute the output of this convolution operation, what you would do is take the 3x3x3 filter and first place it in that upper leftmost position. So notice that this 3x3x3 filter has 27 numbers, 27 parameters, that's 3 cubed. And so what you do is take each of these 27 numbers and multiply them with the corresponding numbers from the red, green, and blue channels of the image. So take the first 9 numbers from the red channel, then the 3 beneath it for the green channel, then the 3 beneath it for the blue channel, and multiply it with the corresponding 27 numbers that are, I guess, covered by this yellow cube shown on the left. Then add up all those numbers, and this gives you this first number in the output. And then to compute the next output, you take this cube and slide it over by 1, and again, do the 27 multiplications, add up the 27 numbers, that gives you this next output. Do it for the next number over, for the next position over, that gives you the third output, and so on. That gives you the fourth, and then one row down, and then the next one, the next one, the next one, and so on. And you get the idea, until at the very end, that's the position you would have for that final output. So what does this allow you to do? Well, here's an example. This filter is 3x3x3, so if you want to detect edges in the red channel of the image, then you could have the first filter be 111-1-1-1, as usual, and have the green channel be all 0s, and have the blue filter be all 0s. And if you have these three stacked together to form your 3x3x3 filter, then this would be a filter that detects edges, vertical edges, but only in the red channel. Alternatively, if you don't care what color the vertical edge is in, then you might have a filter that's like this, where it's this 111-1-1-1 in all three channels. So by setting this second alternative set of parameters, you then have an edge detector, a 3x3x3 edge detector, that detects edges in any color. And with different choices of these parameters, you can get different feature detectives out of this 3x3x3 filter. And by convention, in computer vision, when you have an input with a certain height, a certain width, and a certain number of channels, then your filter will have a potentially different height, different width, but the same number of channels. And in theory, it's possible to have a filter that maybe only looks at the red channel, or maybe a filter that looks at only the green channel and the blue channel. And once again, you notice that convolving a volume, a 6x6x3, convolved with a 3x3x3, that gives a 4x4, a 2D output. Now that you know how to convolve on volumes, there's one last idea that will be crucial for building convolutional neural networks, which is, what if we don't just want to detect vertical edges? What if we want to detect vertical edges and horizontal edges, and maybe 45 degree edges, and maybe 70 degree edges as well? But in other words, what if you want to use multiple filters at the same time? So, here's the picture we had from the previous slide, we had 6x6x3 convolved with 3x3x3 gives 4x4, and maybe this is a vertical edge detector, or maybe it's learned to detect some other feature. Now, maybe the second filter may be denoted by this orange-ish color, which could be a horizontal edge detector. So maybe convolving it with the first filter gives you this first 4x4 output, and convolving with the second filter gives you a different 4x4 output. And what you can do is then take these two 4x4 outputs, take this first one, put it in front, and you can take this second filter output and, well, let me draw it here, put it in the back as follows. So that by stacking these two together, you end up with a 4x4x2 output volume. And you can think of the volume as, if you draw this as a box, I guess it would maybe look like this. So this would be a 4x4x2 output volume, which is the result of taking your 6x6x3 image and convolving it, or applying two different 3x3 filters to it, resulting in two 4x4 outputs that then get stacked up to form a 4x4x2 volume. And the 2 here comes from the fact that we used two different filters. So let's just summarize the dimensions. If you have a n by n by number of channels input image, so in the example this is a 6x6x3 where n subscript capital C is the number of channels, and you convolve that with a f by f by, and again, this should be the same nC, so this was 3x3x3, and by convention, this and this have to be the same number. Then what you get is a n minus f plus 1 by n minus f plus 1 by, and I'm going to use this nC prime, or really, it's really nC of the mixed layer, but this is the number of filters that you use. So this in our example would be 4x4x2. And I wrote these assuming that you use a stride of 1 and no padding, but if you use a different stride of padding, then this n minus f plus 1 would be affected in the usual way as we saw in the previous videos. So this idea of convolution on volumes turns out to be really powerful. Only a small part of it is that you can now operate directly on RGB images with three channels, but even more important is that you can now detect two features, like vertical and horizontal edges, or 10, or maybe 128, or maybe several hundred different features, and the output will then have a number of channels equal to the number of features you are detecting. Oh, and as a note of notation, I've been using your number of channels to denote this last dimension. In the literature, people will also often call this the depth of this 3D volume. And both notations, channels or depth, are commonly used in the literature, but I find depth more confusing because we usually talk about the depth of a neural network as well. So I'm going to use the term channels in these videos to refer to the size of this third dimension of these filters. So now that you know how to implement convolutions over volumes, you now are ready to implement one layer of a convolutional neural network. Let's see how to do that in the next video.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 1: Foundations of Convolutional Neural Networks

Convolutional Neural Networks

Computer Vision
Video
・
5 mins

Edge Detection Example
Video
・
11 mins

More Edge Detection
Video
・
7 mins

Padding
Video
・
9 mins

Strided Convolutions
Video
・
8 mins

Convolutions Over Volume
Video
・
10 mins

One Layer of a Convolutional Network
Video
・
16 mins

Clarifications about Upcoming Simple Convolutional Network Example Video
Reading
・
1 min

Simple Convolutional Network Example
Video
・
8 mins

Pooling Layers
Video
・
10 mins

Clarifications about Upcoming CNN Example Video
Reading
・
1 min

CNN Example
Video
・
12 mins

Clarifications about Upcoming Why Convolutions?
Reading
・
1 min

Why Convolutions?
Video
・
9 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Quiz

The Basics of ConvNets

Graded・Quiz

・

50 mins

Programming Assignments

(Optional) Downloading your Notebook and Refreshing your Workspace
Reading
・
5 mins

Convolutional Model, Step by Step

Graded・Code Assignment

・

3 hours

Convolution Model Application

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yann LeCun Interview
Video
・
27 mins

Week 2: Deep Convolutional Models: Case Studies