In the last video, you learned how to use a convolutional implementation of sliding windows, and that's more computationally efficient, but it still has the problem of not quite outputting the most accurate bounding boxes. In this video, let's see how you can get your bounding box predictions to be more accurate. With sliding windows, you take this discrete set of locations and run the classifier through it. And in this case, none of the boxes really match up perfectly with the position of the car. So maybe that box was the best match. And also, it looks like in the ground truth, the perfect bounding box isn't even quite square. It actually has a slightly wider rectangle, a slightly horizontal aspect ratio. So is there a way to get this algorithm to output more accurate bounding boxes? A good way to get this output more accurate bounding boxes is with the YOLO algorithm. YOLO stands for You Only Look Once, and it's an algorithm due to Joseph Redman, Santosh Devala, Raj Gershwik, and Ali Fahadi. Here's what you do. Let's say you have an input image that's 100 by 100. You're going to place down a grid on this image. And for the purposes of illustration, I'm going to use a three by three grid, although in an actual implementation, you use a finer one, like maybe a 19 by 19 grid. And the basic idea is you're going to take the image classification and localization algorithm that you saw in the first video of this week and apply that to each of the nine grid cells of this image. So to be more concrete, here's how you define the labels you use for training. So for each of the nine grid cells, you would specify a label Y, where the label Y is this eight-dimensional vector, same as you saw previously. You'll first output PC01, depending on whether or not there's an image in that grid cell, and then BX, BY, BH, BW to specify the bounding box if there is an object associated with that grid cell. And then say C1, C2, C3, if you're trying to recognize three classes, not counting the background class. If you're trying to recognize pedestrians, cars, motorcycles, and a background class, then C1, C2, C3 can be the pedestrian, car, and motorcycle classes. So in this image, we have nine grid cells. So you have a vector like this for each of the grid cells. So let's start to the upper left grid cell, this one up here. For that one, there is no object. So the label vector Y for the upper left grid cell would be zero, and then don't cares for the rest of these. And the output label Y would be the same for this grid cell and this grid cell, and all the grid cells with nothing, with no interesting object in them. Now, how about this grid cell? To give a bit more detail, this image has two objects, and what the YOLO algorithm does is it takes the midpoint of each of the two objects, and it assigns the object to the grid cell containing the midpoint. So the left car is assigned to this grid cell, and the car on the right, which has this midpoint, is assigned to this grid cell. And so even though the central grid cell has, you know, some parts of both cars, we'll pretend the central grid cell has no interesting objects. So for the central grid cell, the class label Y also looks like this vector with no object, and as the first component, PC, and then the rest are don't cares. Whereas for this cell, this cell that I've circled in green on the left, the target label Y would be as follows. There is an object, and then you write BX, BY, BH, BW to specify the position of this bounding box. And then you have, let's see, if class 1 was a pedestrian, then that was 0. Class 2 is a car, that's 1. Class 3 was a motorcycle, that's 0. And then similarly, for the grid cell on the right, because that does have an object in it, it will also have some vector like this as the target label corresponding to the grid cell on the right. So for each of these 9 grid cells, you end up with an 8-dimensional output vector. And because you have 3 by 3 grid cells, you have 9 grid cells, the total volume of the output is going to be 3 by 3 by 8. So the target output is going to be 3 by 3 by 8 because you have 3 by 3 grid cells, and for each of the 3 by 3 grid cells, you have an 8-dimensional Y vector. So the target output volume is 3 by 3 by 8, where for example, this 1 by 1 by 8 volume in the upper left corresponds to the target output vector for the upper left of the 9 grid cells. And so for each of the 3 by 3 positions for each of these 9 grid cells, there's a corresponding 8-dimensional target vector Y that you want in the output, some of which could be don't cares if there's no object there. And that's why the total target output, the output label for this image is now itself a 3 by 3 by 8 volume. So now to train your neural network, the input is a 100 by 100 by 3, now that's the input image, and then you have a usual conf net with conf layers, max pool layers, and so on. So that in the end, you would have this, you should choose the conf layers and the max pool layers and so on, so that this eventually maps to a 3 by 3 by 8 output volume. And so what you do is you have an input X, which is the input image like that, and you have these target labels Y, which are 3 by 3 by 8, and you use back propagation to train the neural network to map from any input X to this type of output volume Y. So the advantage of this algorithm is that the neural network outputs precise bounding boxes as follows. So at test time, what you do is you feed in an input image X and run forward prop until you get this output Y, and then for each of the nine outputs, for each of the 3 by 3 positions in which the output, you can then just read off one or zero, you know, is there an object associated with that one of the nine positions, and if there is an object, what object it is, and what is the bounding box for the object in that grid cell. And so long as you don't have more than one object in each grid cell, this algorithm should work okay. And the problem of having multiple objects within a grid cell is something we'll address later, but in practice, you know, I've used a relatively small 3 by 3 grid. In practice, you might use a much finer grid, maybe 19 by 19, so you end up with 19 by 19 by 8, and that also makes your grid much finer and reduces the chance that there are multiple objects assigned to the same grid cell. And just as a reminder, the way you assign an object to a grid cell is you look at the midpoint of an object, and then you assign that object to whichever one grid cell contains the midpoint of the object. So each object, even if the object spans multiple grid cells, that object is assigned only to one of the nine grid cells, or one of the 3 by 3, or one of the 19 by 19 grid cells. And with a 19 by 19 grid, the chance of an object, of two midpoints of objects appearing in the same grid cell is just a bit smaller. So notice two things. First, this is a lot like the image classification and localization algorithm that we talked about in the first video of this week, in that it outputs the bounding box coordinates explicitly. And so this allows the neural network to output bounding boxes of, you know, any aspect ratio, as well as output much more precise coordinates that aren't just dictated by the stride size of your sliding windows classifier. And second, this is a convolutional implementation. You're not implementing this algorithm nine times on a 3 by 3 grid, or if you're using a 19 by 19 grid, 19 squared is 361. So you're not running the same algorithm, you know, 361 times or 19 squared times. Instead, this is one single convolutional implementation where you use one conf net with a lot of shared computation between all the computations needed for all of your, you know, 3 by 3 or all of your 19 by 19 grid cells. So this is a pretty efficient algorithm. And in fact, one nice thing about the YOLO algorithm, which accounts for its popularity, is because this is a convolutional implementation, it actually runs very fast. So this works even for real-time object detection. Now, before wrapping up, there's one more detail I want to share with you, which is how do you encode these bounding boxes, BX, BY, BH, BW. Let's discuss that on the next slide. So given these two cars, remember we have the 3 by 3 grid. Let's take the example of the car on the right. So in this grid cell, there is an object, and so the target label Y will be 1, that was PC is equal to 1, and then BX, BY, BH, BW, and then 0, 1, 0. So how do you specify the bounding box? In the YOLO algorithm, relative to this square, we're going to take the convention that the upper left point here is 0, 0, and this lower right point is 1, 1. So to specify the position of that midpoint, that orange dot, BX might be, let's see, X looks like it's about 0.4, since it's maybe about 0.4 of the way to the right, and then Y looks like that's maybe 0.3, and then the height of the bounding box is specified as a fraction of the overall width of this box. So the width of this red box is maybe 90 percent of that blue line, and so BH is 0.5, and the height of this is maybe one half of the overall height of the grid cell. So in that case, BW would be, let's say, 0.9. So in other words, this BX, BY, BH, BW are specified relative to the grid cell, and so BX and BY, this has to be between 0 and 1, right, because pretty much by definition, that orange dot is within the bounds of that grid cell it's assigned to. If it wasn't between 0 and 1, if it was outside the square, then it would have been assigned to a different grid cell, but these could be greater than 1. In particular, if you had a car where the bounding box was that, then the height and width of the bounding box, this could be greater than 1. So there are multiple ways of specifying the bounding boxes, but this would be one convention that's quite reasonable. Although if you read the YOLO research papers, there are other parametrizations that work even a little bit better, but I hope this gives one reasonable convention that should work okay. Although there are some more complicated parametrizations involving sigmoid functions to make sure this is between 0 and 1, and using an exponential parametrization to make sure that these are non-negative, since 0.9, 0.5, this has to be greater than or equal to 0. There are some other more advanced parametrizations that work even a little bit better, but the one you saw here should work okay. So that's it for the YOLO, or the You Only Look Once, algorithm. And in the next few videos, I'll show you a few other ideas that will help make this algorithm even better. In the meantime, if you want, you can take a look at the YOLO paper referenced at the bottom of these past couple slides I used. Although just one warning if you take a look at these papers, which is the YOLO paper is one of the harder papers to read. I remember when I was reading this paper for the first time, I had a really hard time figuring out what was going on, and I wound up asking a couple of my friends that are very good researchers to help me figure it out, and even they had a hard time understanding some of the details of the paper. So if you look at the paper, it's okay if you have a hard time figuring it out. I wish it was more uncommon, but it's not that uncommon, sadly, for even senior researchers to read research papers and have a hard time figuring out the details and have to look at open source code or contact the authors or something else to figure out the details of these algorithms. But don't let me stop you from taking a look at the paper yourself, though, if you wish, but this is one of the harder ones. So with that, though, you now understand the basics of the YOLO algorithm. Let's go on to some additional pieces that will make this algorithm work even better.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Object Detection

Detection Algorithms

Object Localization
Video
・
11 mins

Landmark Detection
Video
・
5 mins

Object Detection
Video
・
5 mins

Clarifications about Upcoming Convolutional Implementation of Sliding Windows Video
Reading
・
1 min

Convolutional Implementation of Sliding Windows
Video
・
11 mins

Bounding Box Predictions
Video
・
14 mins

Intersection Over Union
Video
・
4 mins

Non-max Suppression
Video
・
8 mins

Anchor Boxes
Video
・
9 mins

Clarifications about Upcoming YOLO Algorithm Video
Reading
・
1 min

YOLO Algorithm
Video
・
6 mins

Region Proposals (Optional)
Video
・
6 mins

Semantic Segmentation with U-Net
Video
・
7 mins

Transpose Convolutions
Video
・
7 mins

U-Net Architecture Intuition
Video
・
3 mins

U-Net Architecture
Video
・
7 mins

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Detection Algorithms

Graded・Quiz

・

50 mins

Programming Assignments

Car detection with YOLO

Graded・Code Assignment

・

3 hours

Clear Output Before Submitting (For U-Net Assignment)
Reading
・
10 mins

Image Segmentation with U-Net

Graded・Code Assignment

・

3 hours

Week 4: Special Applications: Face recognition & Neural Style Transfer