You're now ready to see how to build one layer of a convolutional neural network. Let's go through an example. You've seen in a previous video how to take a 3D volume and convolve it with, say, two different filters in order to get, in this example, two different 4x4 outputs. So let's say convolving with the first filter gives this first 4x4 output, and convolving with this second filter gives a different 4x4 output. The final thing to turn this into a convolutional neural net layer is that for each of these, we're going to add a bias. So this is going to be a row number, and with Python broadcasting, you kind of add the same number to every one of these 16 elements, and then apply a non-linearity, which for illustration, this is a ReLU non-linearity, and this gives you a 4x4 output after applying a bias and a non-linearity. And then for this thing at the bottom as well, you add some different bias, again, this is a row number, so you add the same row number to all 16 numbers, and then apply some non-linearity, let's say a ReLU non-linearity, and this gives you a different 4x4 output. Then, same as we did before, if you take this and stack it up as follows, so you end up with a 4x4x2 output, then this computation, where you've gone from a 6x6x3 to a 4x4x4, this is one layer of a convolutional neural network. So to map this back to one layer of forward propagation in a standard neural network, in a non-convolutional neural network, remember that one step of forward prop was something like this, right? z1 equals w1 times a0, and a0 was also equal to x, and then plus b1, and you apply the non-linearity to get a1, so that's g of z1. So this input here, in this analogy, this is a0, this is x really, and these filters here, this plays a row similar to w1. And you remember, during the convolution operation, you're taking these 27 numbers, or really, well, 27 times 2, because you have two filters, we're taking all of these numbers and multiplying them. So you're really computing a linear function to get this 4x4 matrix. So that 4x4 matrix, the output of the convolution operation, that plays a row similar to w1 times a0. That's really maybe the output of this 4x4 as well as that 4x4. And then the other thing you do is add the bias. So this thing here, before applying ReLU, this plays a row similar to z, and then finally, by applying the non-linearity, it's kind of this, I guess. So this output plays a row, this really becomes your activation at the next layer. So this is how you go from a0 to a1. That's first the linear operation, and then the convolution has all these multiplied. So the convolution is really applying a linear operation, and you add the biases, and you apply a ReLU operation, and you've gone from a 6x6x3 dimensional a0 through one layer of a neural network, to, I guess, a 4x4x2 dimensional a1. So 6x6x3 has gone to 4x4x2, and so that's one layer of a convolutional net. Now, in this example, we had two filters. So we had two features, if you will, which is why we wound up with an output 4x4x2. But if, for example, we instead had ten filters instead of two, then we would have wound up with a 4x4x10 dimensional output volume, because we'd be taking ten of these snaps, not just two of them, and stacking them up to form a 4x4x10 output volume, and that's what a1 would be. So to make sure you understand this, let's go through an exercise. Let's suppose you have ten filters, not just two filters, that are 3x3x3 in one layer of a neural network. How many parameters does this layer have? Well, let's figure this out. Each filter is a 3x3x3 volume, so 3x3x3. So each filter has 27 parameters, right? So there's 27 numbers to be learned. Oh, and then plus the bias. So that was the b parameter, so this gives you 28 parameters. And then if you imagine that on the previous slide, we had drawn two filters, but now if you imagine that you actually have ten of these, one, two, dot, dot, dot, ten of these, then all together you would have 28 times 10, so that would be 280. Parameters. Notice one nice thing about this is that no matter how big the input image is, the input image could be 1,000x1,000 or 5,000x5,000, but the number of parameters you have still remains fixed at 280, and you can use these ten filters to detect features, you know, vertical edges, horizontal edges, maybe other features, anywhere, even in a very, very large image with just a very small number of parameters. So this is really one property of convolutional neural nets that makes them less prone to overfitting that you could, so unless you learn ten feature detectors that work, you could apply this even to very large images, and the number of parameters, you know, still remains fixed and relatively small as 280 in this example. Alright, so to wrap up this video, let's just summarize the notation we're going to use to describe one layer, to describe a convolutional layer in a convolutional neural network. So if layer L is a convolutional layer, I'm going to use F superscript square bracket L to denote the filter size. So previously we've been saying the filters are F by F, and now this superscript square bracket L just denotes that this is a filter size, this is a F by F filter in layer L. And as usual, the superscript square bracket L is the notation we're using to refer to a particular layer L. I'm going to use PL to denote the amount of padding, and again, the amount of padding can also be specified just by saying that you want a valid convolution, which means no padding, or a sane convolution, which means you choose the padding so that the output size has the same height and width as the input size. And then I'm going to use SL to denote the stride. Now, the input to this layer is going to be some dimension. It's going to be some N by N by number of channels in the previous layer. Now, I'm going to modify this notation a little bit. I'm going to use superscript L minus 1 because that's the activation from the previous layer. L minus 1 times NC of L minus 1. And in the example so far, we've been just using images with the same height and width, but in case the height and width might differ, I'm going to use superscript H and superscript W to denote the height and width of the input from the previous layer. Alright, so in layer L, the size of the volume will be NH by NW by NC with superscript, you know, square bracket L. It's just that in layer L, the input to this layer is whatever you had from the previous layer, so that's why you have L minus 1 there. And then the neural network, excuse me, this layer, and then this layer of the neural network will output, will itself output a volume. So that will be NH of L by NW of L by NC of L. That will be the size of the output. And so, whereas we previously said that the output volume size, or at least the height and width, is given by this formula, you know, N plus 2P minus F over S plus 1. And then take the full of that or round it down. In this new notation, what we have is that the output volume, that is in layer L, is going to be the dimension from the previous layer plus the padding we're using in this layer L, minus the filter size we're using in this layer L, and so on. And technically, this is true for the height, right? So the height of the output volume is given by this, and you can compute it with this formula on the right. And the same is true for the width as well. So if you cross out H and throw in W as well, then, you know, the same formula with either the height or the width plugged in works for computing the height or width of the output volume. So that's how NHL minus 1 relates to NHL, and NWL minus 1 relates to NWL. Now, how about the number of channels? Where do those numbers come from? Let's take a look. If the output volume has this depth, well, we know from the previous examples that that's equal to the number of filters we have in that layer, right? So we had two filters. The output volume was 4 by 4 by 2, was two-dimensional. And if you had 10 filters, then the output volume was 4 by 4 by 10. So this, the number of channels in the output volume, that's just the number of filters we're using in this layer of the neural network. Next, how about the size of each filter? Well, each filter is going to be FL by FL by one other number, right? So what is this last number? Well, we saw that you need to convolve a 6 by 6 by 3 image with a 3 by 3 by 3 filter. And so the number of channels in your filter must match the number of channels in your input. So this number should match that number, right? Which is why each filter is going to be FL by FL by N, CL minus one. And the output of this layer, after applying the biases and non-linearity, is going to be the activations of this layer, AL. And that, we've already seen, will be this dimension, right? The AL will be a 3D volume, that's NHL by NWL by NCL. And when you are using a vectorized implementation or a batch gradient descent or mini-batch gradient descent, then you actually output AL, which is a set of M activations, if you have M examples. So that will be M by NHL by NWL by NCL. If, say, you're using batch gradient descents. And in the program exercises, this will be the dimension, this will be ordering of the variables. And we have the index and the training examples first, and then these three variables. Next, how about the weights or the parameters or kind of the W parameter? Well, we saw already what the filter dimension is. So the filters are going to be FL by FL by NCL minus 1. But that's the dimension of one filter. How many filters do we have? Well, this is the total number of filters. So the weights, really, all of the filters put together will have dimension given by this times the total number of filters. Because this last quantity is the number of filters. And then finally, you have the bias parameters and you have one bias parameter, one real number for each filter. So you're going to have, the bias will have this many variables. It's just a vector of this dimension. Although, later on we'll see that in the code it will be more convenient to represent it as a 1 by 1 by 1 by NCL dimension. Or a 4 dimensional matrix or 4 dimensional tensor. So I know that was a lot of notation. And this is the convention I've used for the most part. I just want to mention, in case you search online and look at open source code, there isn't a completely universal standard convention about the ordering of height, width, and channel. So if you look on source code on GitHub or read some of the open source implementations, you find that some authors use this order instead, where you first put the channel first. And you sometimes see that ordering of the variables. And in fact, in some programming frameworks, actually in multiple programming frameworks, there's actually a variable or a parameter where you want to list the number of channels first, or list the number of channels lost when indexing into these volumes. And I think both of these conventions work okay, so long as they are consistent. And unfortunately, maybe this isn't. And unfortunately, maybe this is one piece of notation where there isn't consensus in the deep learning literature. But I'm going to use this convention for these videos, where we list height and then width and then the number of channels lost. So I know that was suddenly a lot of new notation to introduce. But you're thinking, wow, this is a lot of notation. How do I need to remember all of these? Don't worry about it. You don't need to remember all of this notation. And through this week's exercises, you become more familiar with it at that time. But the key point I hope you take away from this video is just how one layer of a convolutional neural network works and the computations involved in taking the activations of one layer and mapping that to the activations of the next layer. And next, now that you know how one layer of a convolutional neural network works, let's stack a bunch of these together to actually form a deeper convolutional neural network. Let's go on to the next video to see how that works.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 1: Foundations of Convolutional Neural Networks

Convolutional Neural Networks

Computer Vision
Video
・
5 mins

Edge Detection Example
Video
・
11 mins

More Edge Detection
Video
・
7 mins

Padding
Video
・
9 mins

Strided Convolutions
Video
・
8 mins

Convolutions Over Volume
Video
・
10 mins

One Layer of a Convolutional Network
Video
・
16 mins

Clarifications about Upcoming Simple Convolutional Network Example Video
Reading
・
1 min

Simple Convolutional Network Example
Video
・
8 mins

Pooling Layers
Video
・
10 mins

Clarifications about Upcoming CNN Example Video
Reading
・
1 min

CNN Example
Video
・
12 mins

Clarifications about Upcoming Why Convolutions?
Reading
・
1 min

Why Convolutions?
Video
・
9 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Quiz

The Basics of ConvNets

Graded・Quiz

・

50 mins

Programming Assignments

(Optional) Downloading your Notebook and Refreshing your Workspace
Reading
・
5 mins

Convolutional Model, Step by Step

Graded・Code Assignment

・

3 hours

Convolution Model Application

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yann LeCun Interview
Video
・
27 mins

Week 2: Deep Convolutional Models: Case Studies