In the last video, you learned about the sliding windows object detection algorithm using a continent, but we saw that it was too slow. In this video, you learn how to implement that algorithm convolutionally. Let's see what this means. To build up toward the convolutional implementation of sliding windows, let's first see how you can turn fully connected layers in your neural network into convolutional layers. We'll do that first on this slide, and in the next slide, we'll use the ideas from this slide to show you the convolutional implementation. Let's say that your object detection algorithm inputs 14x14x3 images. This is quite small, but just for illustrative purposes. Let's say it then uses 5x5 filters, and let's say uses 16 of them to map it from 14x14x3 to 10x10x16, and then does a 2x2 max pooling to reduce it to 5x5x16, then has a fully connected layer to connect to 400 units, then another fully connected layer, and then finally outputs y using a softmax unit. In order to make the change we'll need to in a second, I'm going to change this picture a little bit, and instead I'm going to view y as four numbers corresponding to the class probabilities of the four classes that the softmax unit is classifying amongst. And the four classes could be pedestrian, car, motorcycle, and background, or something else. Now what I'd like to do is show how these layers can be turned into convolutional layers. So the conf net, I'm going to draw the same as before for the first few layers, and now one way of implementing this next layer, this fully connected layer, is to implement this as a 5x5 filter, and let's use 400 5x5 filters. So if you take a 5x5x16 image and convolve it with a 5x5 filter, remember a 5x5 filter is implemented as 5x5x16, because our convention is that the filter looks across all 16 channels, so this 16 and this 16 must match, and so the output will be 1x1. And if you have 400 of these 5x5x16 filters, then the output dimension is going to be 1x1x400. And so rather than viewing these 400 as just a set of nodes, we're going to view this as a 1x1x400 volume. And mathematically this is the same as a fully connected layer, because each of these 400 nodes has a filter of dimension 5x5x16, and so each of those 400 values is some arbitrary linear function of these 5x5x16 activations from the previous layer. Next, to implement the next convolutional layer, we're going to implement a 1x1 convolution, and if you have 400 1x1 filters, then with 400 filters, the next layer will again be 1x1x400, so that gives you this next fully connected layer. And then finally, we're going to have another 1x1 filter followed by a softmax activation, so as to give a 1x1x4 volume to take the place of these four numbers that the network was outputting. So this shows how you can take these fully connected layers and implement them using convolutional layers, so that these sets of units instead are now implemented as 1x1x400 and 1x1x4 volumes. Armed with this conversion, let's see how you can have a convolutional implementation of sliding windows object detection. And the presentation on this slide is based on the overfeed paper referenced at the bottom by Pierre Cermigny, David Eigen, Ziang Zhang, Michael Matthew, Rob Ferguson, and Yann LeCun. Let's say that your sliding windows conf net inputs 14x14x3 images, and again I'm just using small numbers like the 14x14 image in this slide mainly to make the numbers and illustrations simpler. So as before, you have a neural network as follows that eventually outputs a 1x1x4 volume, which is the output of your softmax unit. And again, to simplify the drawing here, 14x14x3 is technically a volume, 5x5 or 10x10x16 is technically a volume, but to simplify the drawing for this slide, I'm just going to draw the front face of these volumes. So instead of drawing a 1x1x400 volume, I'm just going to draw the 1x1 parts of all of these. So just drop the 3D components of these drawings just for this slide. So let's say that your conf net inputs 14x14 images or 14x14x3 images, and your test set image is 16x16x3. So I've now added that yellow stripe to the border of this image. So in the original sliding windows algorithm, you might want to input the blue region into your conf net and run that once to generate a classification of 01. And then slide it down a bit, let's use a stride of 2 pixels. And then you might slide that to the right by 2 pixels to input this green rectangle into the conf net and rerun the whole conf net and get another label 01. And then you might input this orange region into the conf net and run it one more time to get another label. And then do it a fourth and final time with this lower right, now purple square. And so to run sliding windows on this 16x16x3 image, it's a pretty small image, you run this conf net from above four times in order to get four labels. But it turns out a lot of this computation done by these four conf nets is highly duplicated. So what a convolutional implementation of sliding windows does is it allows these four forward parsers of the conf net to share a lot of computation. Specifically, here's what you can do. You can take the conf net and just run it, same parameters, the same 5x5 by filters, all 16 5x5 filters, and run it. And now you can have a 12x12x16 output volume and then do the max pool, same as before. Now you have a 6x6x16, run through your same 400 5x5 filters to get now your 2x2x40 volume. So now instead of a 1x1x400 volume, you have instead a 2x2x400 volume. Run it through your 1x1 filter, gives you another 2x2x400 instead of 1x1x400. Do that one more time and now you're left with a 2x2x4 output volume instead of 1x1x4. And it turns out that this blue 1x1x4 subset gives you the result of running in the upper left hand corner 14x14 image. This upper right 1x1x4 volume gives you the upper right result. The lower left gives you the results of implementing the conf net on the lower left 14x14 region. And the lower right 1x1x4 volume gives you the same result as running the conf net on the lower right 14x14 region. And if you step through all the steps of the calculation, let's look at a green example. If you had cropped out just this region and passed it through the conf net on top, then the first layer's activations would have been exactly this region, the next layer's activation after max pooling would have been exactly this region, and then the next layer would have been as follows. So what this process does, what this convolutional implementation does is instead of forcing you to run forward propagation on four subsets of the input image independently, instead it combines all four into one forward propagation and shares a lot of the computation in the regions of the image that are common to all four of the 14x14 patches we saw here. Now let's just go through a bigger example. Let's say you now want to run sliding windows on a 28x28x3 image. It turns out if you run forward prop the same way, then you end up with an 8x8x4 output, and this corresponds to running sliding windows with that 14x14 region, and that corresponds to running sliding windows first on that region, thus giving you the output corresponding to the upper left hand corner, then using a stride of two to shift one window over, one window over, one window over, and so on, and they're eight positions. So that gives you this first row, and then as you go down the image as well, that gives you all of these 8x8x4 outputs. And it's because of the max pooling of two that this corresponds to running your neural network with a stride of two on the original image. So just to recap, to implement sliding windows previously, what you do is you crop out a region, let's say this is 14x14, and run that to your conf net, then do that for the next region over, then do that for the next 14x14 region, then the next one, then the next one, then the next one, then the next one, and so on, until hopefully that one recognizes the car. But now instead of doing it sequentially, with this convolutional implementation that you saw on the previous slide, you can implement the entire image, all maybe 20x28, and convolutionally make all the predictions at the same time by one forward pass through this big conf net and hopefully have it recognize the position of the car. So that's how you implement sliding windows convolutionally and it makes the whole thing much more efficient. Now this algorithm still has one weakness, which is the position of the bounding boxes is not going to be too accurate. In the next video, let's see how you can fix that problem.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Object Detection

Detection Algorithms

Object Localization
Video
・
11 mins

Landmark Detection
Video
・
5 mins

Object Detection
Video
・
5 mins

Clarifications about Upcoming Convolutional Implementation of Sliding Windows Video
Reading
・
1 min

Convolutional Implementation of Sliding Windows
Video
・
11 mins

Bounding Box Predictions
Video
・
14 mins

Intersection Over Union
Video
・
4 mins

Non-max Suppression
Video
・
8 mins

Anchor Boxes
Video
・
9 mins

Clarifications about Upcoming YOLO Algorithm Video
Reading
・
1 min

YOLO Algorithm
Video
・
6 mins

Region Proposals (Optional)
Video
・
6 mins

Semantic Segmentation with U-Net
Video
・
7 mins

Transpose Convolutions
Video
・
7 mins

U-Net Architecture Intuition
Video
・
3 mins

U-Net Architecture
Video
・
7 mins

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Detection Algorithms

Graded・Quiz

・

50 mins

Programming Assignments

Car detection with YOLO

Graded・Code Assignment

・

3 hours

Clear Output Before Submitting (For U-Net Assignment)
Reading
・
10 mins

Image Segmentation with U-Net

Graded・Code Assignment

・

3 hours

Week 4: Special Applications: Face recognition & Neural Style Transfer