One of the problems with object detection, as you've seen it so far, is that each of the grid cells can detect only one object. What if a grid cell wants to detect multiple objects? Here's what you can do. You can use the idea of anchor boxes. Let's start with an example. Let's say you have an image like this, and for this example I'm going to continue to use a 3x3 grid. Notice that the midpoint of the pedestrian and the midpoint of the car are in almost the same place, and both of them fall into the same grid cell. So for that grid cell, if y outputs this vector, where you are detecting three classes, pedestrians, cars, and motorcycles, it won't be able to output two detections, so I have to pick one of the two detections to output. With the idea of anchor boxes, what you're going to do is predefine two different shapes called anchor boxes or anchor box shapes. And what you're going to do is now be able to associate two predictions with the two anchor boxes. And in general you might use more anchor boxes, maybe five or even more, but for this video I'm just going to use two anchor boxes just to make the description easier. So what you do is you define the class label to be, instead of this vector on the left, you basically repeat this twice. So you would have PC, BX, BY, BH, BW, C1, C2, C3, and these are the eight outputs associated with anchor box 1. And then you repeat that, PC, BX, and so on, down to C1, C2, C3, and now there are eight outputs associated with anchor box 2. So because the shape of the pedestrian is more similar to the shape of anchor box 1 than anchor box 2, you can use these eight numbers to encode that PC is 1, yes there is a pedestrian. Use this to encode the bounding box around the pedestrian, and then use these to encode that that object is a pedestrian. And then because the car, the box around the car is more similar to the shape of anchor box 2 than anchor box 1, you can then use these to encode that the second object here is the car and have the bounding box and so on be all the parameters associated with the detected car. So to summarize, previously, before you were using anchor boxes, you did the following, which is for each object in the training set, in the training set image, it was assigned to the grid cell that corresponds to that object's midpoint. And so the output Y was 3 by 3 by 8 because you have the 3 by 3 grid and for each grid position, we had that output vector, which is PC, then the bounding box, then C1, C2, C3. With the anchor box, you now do the following. Now, each object is assigned to the same grid cell as before, assigned to the grid cell that contains that object's midpoint, but is assigned to a grid cell and anchor box with the highest IOU with the object's shape. So you have two anchor boxes, you would take an object and see. So if you have an object with this shape, what you do is take your two anchor boxes, maybe one anchor box is this shape, that's anchor box 1, maybe anchor box 2 is this shape, and then you see which of the two anchor boxes has a higher IOU with the ground truth bounding box, and whichever it is, that object then gets assigned not just to a grid cell, but to a pair. It gets assigned to a grid cell comma anchor box pair. And that's how that object gets encoded in the target label. And so now, the output Y is going to be 3 by 3 by 16, because as you saw on the previous slide, Y is now 16-dimensional. Or if you want, you can also view this as 3 by 3 by 2 by 8, because there are now two anchor boxes and Y is 8-dimensional. Oh, and the dimension of Y being 8 was because we have three object classes. If you have more objects, then the dimension of Y will be even higher. So let's go through a concrete example. For this grid cell, let's specify what is Y. So the pedestrian is more similar to the shape of anchor box 1. So for the pedestrian, we're going to assign it to the top half of this vector. So, yes, there is an object, there will be some bounding box associated with the pedestrian, and I guess if a pedestrian is class 1, then it will be C1 is 1 and then 0 is 0. And then the shape of the car is more similar to anchor box 2, and so the rest of this vector will be 1. And then the bounding box associated with the car, and then the car is C2, so that's 0, 1, 0. And so that's the label Y for that lower middle grid cell that this arrow is pointing to. Now, what if this grid cell only had a car and had no pedestrian? If it only had a car, then assuming that the shape of the bounding box around the car is still more similar to anchor box 2, then the target label Y, if there was just a car there and the pedestrian had gone away, it would still be the same for the anchor box 2 component. Remember that this is a part of the vector corresponding to anchor box 2, and for the part of the vector corresponding to anchor box 1, what you do is you just say there is no object there, so PC is 0, and then the rest of these will be don't cares. Now, just some additional details. What if you have two anchor boxes but three objects in the same grid cell? That's one case that this algorithm doesn't handle well. Hopefully, it won't happen, but if it does, this algorithm doesn't have a great way of handling it. I would just implement some default tiebreaker for that case. Or what if you have two objects associated with the same grid cell, but both of them have the same anchor box shape? Again, that's another case that this algorithm doesn't handle well. If you implement some default way of tiebreaking, if that happens, hopefully this won't happen in your data set, it won't happen much at all, and so it shouldn't affect performance much. So, that's it for anchor boxes. And even though I've motivated anchor boxes as a way to deal with what happens if two objects appear in the same grid cell, in practice that happens quite rarely, especially if you use a 19x19 rather than a 3x3 grid. You know, the chance of two objects having the same midpoint out of these 261 cells, it does happen, but it doesn't happen that often. The maybe even better motivation, the even better result that anchor boxes gives you is it allows your learning algorithm to specialize better. In particular, if your data set has some tall, skinny objects like pedestrians and some wide objects like cars, then this allows your learning algorithm to specialize so that some of the outputs can specialize in detecting wide, fat objects like cars and some of the output units can specialize in detecting tall, skinny objects like pedestrians. So, finally, how do you choose the anchor boxes? People used to just choose them by hand, choose maybe five or ten anchor box shapes that spans a variety of shapes that seems to cover the types of objects you seem to detect. As a much more advanced version, just an advanced comment for those of you that have other knowledge in machine learning, an even better way to do this in one of the later YOLO research papers is to use a k-means algorithm to group together the types of object shapes you tend to get and if you use that to select a set of anchor boxes that is most stereotypically representative of the maybe multiple, the maybe dozens of object classes you're trying to detect. But that's a more advanced way to automatically choose the anchor boxes. And if you just choose by hand a variety of shapes that reasonably spans the set of object shapes you expect to detect, some tall, skinny ones, some fat, wide ones, that should work reasonably as well. So that's it for anchor boxes. In the next video, let's take everything we've seen and tie it back together into the YOLO algorithm.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Object Detection

Detection Algorithms

Object Localization
Video
・
11 mins

Landmark Detection
Video
・
5 mins

Object Detection
Video
・
5 mins

Clarifications about Upcoming Convolutional Implementation of Sliding Windows Video
Reading
・
1 min

Convolutional Implementation of Sliding Windows
Video
・
11 mins

Bounding Box Predictions
Video
・
14 mins

Intersection Over Union
Video
・
4 mins

Non-max Suppression
Video
・
8 mins

Anchor Boxes
Video
・
9 mins

Clarifications about Upcoming YOLO Algorithm Video
Reading
・
1 min

YOLO Algorithm
Video
・
6 mins

Region Proposals (Optional)
Video
・
6 mins

Semantic Segmentation with U-Net
Video
・
7 mins

Transpose Convolutions
Video
・
7 mins

U-Net Architecture Intuition
Video
・
3 mins

U-Net Architecture
Video
・
7 mins

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Detection Algorithms

Graded・Quiz

・

50 mins

Programming Assignments

Car detection with YOLO

Graded・Code Assignment

・

3 hours

Clear Output Before Submitting (For U-Net Assignment)
Reading
・
10 mins

Image Segmentation with U-Net

Graded・Code Assignment

・

3 hours

Week 4: Special Applications: Face recognition & Neural Style Transfer