Hello, and welcome back. This week, you learned about object detection. This is one of the areas of computer vision that's just exploding and it's working so much better than just a couple of years ago. In order to build up to object detection, you first learn about object localization. Let's start by defining what that means. You're already familiar with the image classification task where an algorithm looks at this picture and might be responsible for saying, this is a car. So that was classification. The problem you learn to build a neural network to address later in this video is classification with localization, which means not only do you have to label this as, say, a car, but the algorithm also is responsible for putting a bounding box or drawing a red rectangle around the position of the car in the image. So that's called the classification with localization problem, where the term localization refers to figuring out where in the picture is the car you've detected. Later this week, you then learn about the detection problem where now there might be multiple objects in the picture and you have to detect them all and localize them all. And if you're doing this for an autonomous driving application, then you might need to detect not just other cars, but maybe other pedestrians and motorcycles and maybe even other objects. So you see that later this week. So in the terminology we'll use this week, the classification and the classification of localization problems usually have one object, usually one big object in the middle of the image that you're trying to recognize or recognize and localize. In contrast, in the detection problem, there can be multiple objects, and in fact, maybe even multiple objects of different categories within a single image. So the ideas you've learned about for image classification will be useful for classification with localization. And then the ideas you learn for localization will then turn out to be useful for detection. So let's start by talking about classification with localization. You're already familiar with the image classification problem in which you might input a picture into a conf net with multiple layers. So there's our conf net and this results in a vector of features that is set to maybe a softmax unit that outputs the predicted class. So if you are building a self-driving car, maybe your object categories are the following where you might have a pedestrian or a car or a motorcycle or background, this means none of the above. So if there's no pedestrian, no car, no motorcycle, then you might have an output background. So if these are your classes, then you have a softmax with four possible outputs. So this is the standard classification pipeline. How about if you want to localize the car in the image as well? To do that, you can change your neural network to have a few more output units that output a bounding box. So in particular, you can have the neural network output four more numbers and I'm going to call them BX, BY, BH, and BW. And these four numbers parameterize the bounding box of the detected object. So in these videos, I'm going to use the notational convention that the upper left of the image, I'm going to denote as the coordinate 0, 0, and the lower right is 1, 1. So specifying the bounding box, the red rectangle requires specifying the midpoint. So that's the point BX, BY, as well as the height. That would be BH, as well as the width, BW, of this bounding box. So now, if your training set contains not just the object class label, which your neural network is trying to predict up here, but it also contains four additional numbers giving the bounding box, then you can use supervised learning to make your algorithm output not just the class label, but also the four parameters to tell you where is the bounding box of the object you detected. So in this example, the ideal BX might be about 0.5 because it's about halfway to the right of the image. BY might be about 0.7 since that's about, you know, maybe 70% of the way down to the image. BH might be about 0.3 because the height of this red square is about 30% of the overall height of the image, and BW might be about 0.4, let's say, because the width of the red box is about 0.4 of the overall width of the entire image. So let's formalize this a bit more in terms of how we define the target label Y for this as a supervised learning task. So just as a reminder, these are our four classes, and the neural network now outputs those four numbers as well as a class label, or maybe probabilities of the class labels. So let's define the target label Y as follows, it's going to be a vector where the first component, PC, is going to be, is there an object? So if the object is classes 1, 2, or 3, PC will be equal to 1, and if it's the background class, so if it's none of the objects you're trying to detect, then PC will be 0. And PC, you can think of that as standing for the probability that there's an object, probability that one of the classes you're trying to detect is there, so something other than the background class. Next, if there is an object, then you want it to output BX, BY, BH, and BW, the bounding box for the object you detected. And finally, if there is an object, so if PC is equal to 1, you want it to also output C1, C2, and C3, which tells it, is it the class 1, class 2, or class 3? So is it a pedestrian, a car, or a motorcycle? And remember, in the problem we're addressing, we assume that your image has only one object. So at most, one of these objects appears in the picture in this classification with localization problem. So let's go through a couple of examples. If this is a training set image, so if that is X, then Y will be, the first component, PC, will be equal to 1, because there is an object, then BX, BY, BH, and BW will specify the bounding box, so your label training set will need bounding boxes in the labels. And then finally, this is a car. So it's class 2, so C1 will be 0, because it's not a pedestrian, C2 will be 1, because it is a car, C3 will be 0, since it's not a motorcycle. So among C1, C2, C3, at most, one of them should be equal to 1. So that's if there is an object in the image. What if there's no object in the image? What if you have a training example where X is equal to that? In this case, PC will be equal to 0, and the rest of the elements of this will be don't cares. So I'm going to write question marks in all of them. So this is a don't care. Because if there is no object in this image, then you don't care what bounding box the neural network outputs, as well as which of the three objects, C1, C2, C3, it thinks it is. So given a set of label training examples, this is how you construct X, the input image, as well as Y, the class label, both for images where there is an object and for images where there is no object. And the set of these will then define your training set. Finally, next let's describe the loss function you use to train the neural network. So the ground truth label was Y, and the neural network outputs some Y hat. What should be the loss be? Well, if you're using squared error, then the loss can be Y1 hat minus Y1 squared plus Y2 hat minus Y2 squared plus dot, dot, dot, plus Y8 hat minus Y8 squared. Notice that Y here has eight components, so that goes from some of the squares of the difference of the elements. And that's the loss if Y1 is equal to 1. So that's the case where there is an object. So Y1 is equal to PC, right? So if PC is equal to 1, that is, if there is an object in the image, then the loss can be the sum of squares over all the different elements. The other case is if Y1 is equal to 0. So that's if this PC is equal to 0. In that case, the loss can be just Y1 hat minus Y1 squared, because in that second case, all the rest of the components don't care. And so all you care about is how accurately is the neural network outputting PC in that case. So just to recap, if Y1 is equal to 1, that's this case. Then you can use squared error to penalize squared deviation from the predicted and the actual outputs for all eight components. Whereas if Y1 is equal to 0, then, you know, the second to the eighth components don't care. So all you care about is how accurately is your neural network estimating Y1, which is equal to PC. And just as a side comment for those of you that want to know all the details, I've used the squared error just to simplify the description here. In practice, you could use, you can probably use a log likelihood loss for the C1, C2, C3 to the softmax output, one of those elements. Usually, you can use squared error or something like squared error for the bounding box coordinates, and then for PC, you could use something like the logistic regression loss. Although, even if you use squared error, it'll probably work okay. So that's how you get a neural network to not just classify an object, but also to localize it. The idea of having a neural network output a bunch of real numbers to tell you where things are in the picture turns out to be a very powerful idea. In the next video, I want to share with you some other places where this idea of having a neural network output a set of real numbers, almost as a regression toss, can be very powerfully used elsewhere in computer vision as well. So let's go on to the next video.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Object Detection

Detection Algorithms

Object Localization
Video
・
11 mins

Landmark Detection
Video
・
5 mins

Object Detection
Video
・
5 mins

Clarifications about Upcoming Convolutional Implementation of Sliding Windows Video
Reading
・
1 min

Convolutional Implementation of Sliding Windows
Video
・
11 mins

Bounding Box Predictions
Video
・
14 mins

Intersection Over Union
Video
・
4 mins

Non-max Suppression
Video
・
8 mins

Anchor Boxes
Video
・
9 mins

Clarifications about Upcoming YOLO Algorithm Video
Reading
・
1 min

YOLO Algorithm
Video
・
6 mins

Region Proposals (Optional)
Video
・
6 mins

Semantic Segmentation with U-Net
Video
・
7 mins

Transpose Convolutions
Video
・
7 mins

U-Net Architecture Intuition
Video
・
3 mins

U-Net Architecture
Video
・
7 mins

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Detection Algorithms

Graded・Quiz

・

50 mins

Programming Assignments

Car detection with YOLO

Graded・Code Assignment

・

3 hours

Clear Output Before Submitting (For U-Net Assignment)
Reading
・
10 mins

Image Segmentation with U-Net

Graded・Code Assignment

・

3 hours

Week 4: Special Applications: Face recognition & Neural Style Transfer