Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems. There are a few things that are unique about the application of deep learning to computer vision, about the status of computer vision. In this video, I'm going to share with you some of my observations about deep learning for computer vision, in the hope that that will help you better navigate the literature and the set of ideas out there, and how you build these systems yourself for computer vision. So, you can think of most machine learning problems as falling somewhere on the spectrum between where you have relatively little data to where you have lots of data. So, for example, I think that today we have a decent amount of data for speech recognition, at least relative to the complexity of the problem. And even though there are reasonably large data sets today for image recognition or image classification, because image recognition is just a complicated problem to look at all those pixels and figure out what it is, it feels like, even though the online data sets are quite big, like over a million images, it feels like we still wish we had more data. And there are some problems, like object detection, where we have even less data. So, just as a reminder, image recognition was a problem of looking at a picture and telling you, is this a cat or not? Whereas, object detection is looking at a picture and actually putting the bounding boxes, so telling you where in the picture the objects, such as the cars, are as well. And so, because of the cost of getting the bounding boxes, it's just more expensive to label the objects and the bounding boxes, so we tend to have less data for object detection than for image recognition. And object detection is something we'll discuss next week. So, if you look across a broad spectrum of machine learning problems, you see on average that when you have a lot of data, you tend to find people getting away with using simpler algorithms as well as less hand engineering. So, there's just less needing to carefully design features for the problem, but instead you can have a giant neural network, even a simpler architecture, and have a neural network just learn whatever it wants to learn when you have a lot of data. Whereas, in contrast, when you don't have that much data, then on average you see people engaging in more hand engineering, and if you want to be ungenerous, you can say there are more hacks. But I think when you don't have much data, then hand engineering is actually the best way to get good performance. So, when I look at machine learning applications, I think usually the learning algorithm has two sources of knowledge. One source of knowledge is the labeled data, really the X-Y pairs you use for supervised learning. And the second source of knowledge is the hand engineering, and there are lots of ways to hand engineer a system. It can be from carefully hand designing the features, to carefully hand designing the network architectures, to maybe other components of your system. And so, when you don't have much labeled data, you just have to count more on hand engineering. And so, I think computer vision is trying to learn a really complex function, and it often feels like we don't have enough data for computer vision. Even though data sets are getting bigger and bigger, often we just don't have as much data as we need. And this is why the state of computer vision, historically and even today, has relied more on hand engineering. And I think this is also why the field of computer vision has developed rather complex network architectures, is because in the absence of more data, the way to get good performance is to spend more time architecting, or, you know, fiddling around with the network architecture. And in case you think I'm being derogatory of hand engineering, that's not at all my intent. When you don't have enough data, hand engineering is a very difficult, very skillful task that requires a lot of insight. And someone that is insightful in hand engineering will get better performance, and it's a great contribution to a project to do that hand engineering when you don't have enough data. It's just when you have lots of data, then I wouldn't spend time hand engineering. I would spend time, you know, building up the learning system instead. But I think historically, the field of computer vision has used very small data sets, and so historically, the computer vision literature has relied on a lot of hand engineering. And even though in the last few years, the amount of data we throw at computer vision tasks has increased dramatically, I think that that has resulted in a significant reduction in the amount of hand engineering that's being done. But there's still a lot of hand engineering of network architectures in computer vision, which is why you see very complicated hyperparameter choices in computer vision, more complex than you do in a lot of other disciplines. And in fact, because you usually have smaller object detection data sets than image recognition data sets, when we talk about object detection, that is tasks like this, next week, you see that the algorithms, or if you want to be, you see that the algorithms become even more complex and has even more, you know, specialized components. Fortunately, one thing that helps a lot when you have little data is transfer learning. And I would say, for the example from the previous slide of the TIGR, MISTI, NIDA detection problem, you have so little data that transfer learning, you know, will help a lot. And so that's another set of techniques that's used a lot for when you have relatively little data. If you look at the computer vision literature, look at the set of ideas out there, you also find that people are really enthusiastic. They're really into doing well on standardized benchmark data sets and on winning competitions. And for computer vision researchers, if you do well on a benchmark, it's easier to get a paper published. So there's just a lot of attention on doing well on these benchmarks. And the positive side of this is that it helps the whole community figure out what are the most effective algorithms. But you also see in the papers, people do things that allow you to do well on a benchmark, but that you wouldn't really use in a production or a system that you deploy in an actual application. So here are a few tips of doing well on benchmarks. These are things that I don't myself pretty much ever use if I'm putting a system into production, that is to actually serve customers. But one is ensembling. And what that means is, after you've figured out what neural network you want, train several neural networks independently and average their outputs. So initialize, say, three or five or seven neural networks randomly and train up all of these neural networks and then average their outputs. And by the way, it's important to average their outputs light-hands. Don't average their weights. That won't work. Look at your, say, seven neural networks that have seven different predictions and average that. And this will cause you to do maybe 1% better or 2% better or a little bit better on some benchmark. And this will cause you to do a little bit better. Maybe sometimes as much as 1% or 2%, which you really hope will win a competition. But because ensembling means that to test on each image, you might need to run an image through anywhere from, say, 3 to 15 different networks, which is quite typical, this slows down your running time by a factor of 3 to 15, or sometimes even more. And so ensembling is one of those tips that people use for doing well on benchmarks and for winning competitions, but that I think is almost never used in production to serve actual customers. I guess unless you have a huge computational budget and don't mind burning a lot more of it per customer image. Another thing you see in papers that really helps on benchmarks is multi-crop and test time. So what I mean by that is you've seen how you can do data augmentation, and multi-crop is a form of applying data augmentation to your test image as well. So, for example, let's say you have a cat image, and let's just copy it 4 times including 2 mirrored versions. There's a technique called the 10-crop, which basically says, let's say you take this central region, that crop, and run it through your classifier, and then take that crop, upper left-hand corner, run through your classifier, upper right-hand corner, shown in green, lower left, shown in yellow, lower right, shown in orange, and run that through your classifier, and then do the same thing with the mirrored image. So take the central crop, then take the 4 corners crops, so that's 1 central crop here and here, that's 4 corners crop here and here, and if you add these up, that's 10 different crops of the image, so hence the name 10-crop. And so what you do is you run these 10 images through your classifier and then average the results. So if you have a computational budget, you could do this. Maybe you don't need as many as 10 crops, you can use a few crops, and this might get you a little bit better performance in a production system. By production, I mean a system you're deploying for actual users. But this is another technique that is used much more for doing well on benchmarks than in actual production systems. And one of the big problems of ensembling is that you need to keep all these different networks around, and so that just takes up a lot more computer memory. For multi-crop, I guess at least you keep just one network around, so it doesn't suck up as much memory, but it still slows down your run time quite a bit. So these are tips you see, and research papers will refer to these tips as well, but I personally do not tend to use these methods when building production systems, even though they are great for doing better on benchmarks and on winning competitions. Because a lot of computer vision problems are in the small data regime, others have done a lot of hand engineering of the network architectures, and a neural network that works well on one vision problem often, maybe surprisingly, but it just often will work on other vision problems as well. So to build a practical system, often you do well starting off with someone else's new network architecture, and you can use an open source implementation if possible, because the open source implementation might have figured out all the finicky details like the learning rate, the case schedule, and other hyperparameters. And finally, someone else might have spent weeks training a model on half a dozen GPUs, and on over a million images, and so by using someone else's pre-trained model and fine tuning on your dataset, you can often get going much faster on an application. But of course, if you have the compute resources and the inclination, don't let me stop you from training your own networks from scratch. And in fact, if you want to invent your own computer vision algorithm, that's what you might have to do. So that's it for this week. I hope that seeing a number of computer vision architectures helps you get a sense of what works. In this week's programming exercises, you actually learn another programming framework and use that to implement ResNet. So I hope you enjoy that programming exercise, and I look forward to seeing you next week.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Deep Convolutional Models: Case Studies

Case Studies

Why look at case studies?
Video
・
2 mins

Classic Networks
Video
・
18 mins

ResNets
Video
・
7 mins

Why ResNets Work?
Video
・
9 mins

Networks in Networks and 1x1 Convolutions
Video
・
6 mins

Clarifications about Upcoming Inception Network Motivation Video
Reading
・
1 min

Inception Network Motivation
Video
・
10 mins

Inception Network
Video
・
8 mins

MobileNet
Video
・
16 mins

MobileNet Architecture
Video
・
8 mins

EfficientNet
Video
・
3 mins

Practical Advice for Using ConvNets

Using Open-Source Implementation
Video
・
4 mins

Transfer Learning
Video
・
8 mins

Data Augmentation
Video
・
9 mins

State of Computer Vision
Video
・
12 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Deep Convolutional Models

Graded・Quiz

・

50 mins

Programming Assignments

Note on the Upcoming Programming Assignment - Residual Networks
Reading
・
1 min

Residual Networks

Graded・Code Assignment

・

3 hours

Transfer Learning with MobileNet

Graded・Code Assignment

・

3 hours

Week 3: Object Detection