One of the most powerful ideas in deep learning is that sometimes you can take knowledge the neural network has learned from one task and apply that knowledge to a separate task. So, for example, maybe you can have a neural network learn to recognize objects like cats and then use that knowledge or use part of that knowledge to help you do a better job reading x-ray scans. This is called transfer learning. Let's take a look. Let's say you've trained a neural network on image recognition. So you first take a neural network and train it on x-y pairs where x is an image and y is some object in the image. It's a cat or a dog or a bird or something else. If you want to take this neural network and adapt, or we say transfer, what is learned to a different task, such as radiology diagnosis, meaning really reading x-ray scans, what you can do is take this lost output layer of the neural network and just delete that and delete also the weights feeding into that lost output layer and create a new set of randomly initialized weights just for the lost layer and have that now output radiology diagnosis. So, to be concrete, during the first phase of training, when you're training on an image recognition task, you train all of the usual parameters for the neural network, all the weights, all the layers, and you have something that now learns to make image recognition predictions. Having trained that neural network, what you now do to implement transfer learning is swap in a new data set, x-y, where now these are radiology images and y are the diagnosis you want to predict. And what you do is initialize the lost layer's weights, let's call that WL and BL randomly, and now retrain the neural network on this new data set, on the new radiology data set. You have a couple options of how you retrain the neural network with radiology data. If you have a small radiology data set, you might want to just retrain the weights of the lost layer, just WL, BL, and keep the rest of the parameters fixed. If you have enough data, you could also retrain all the layers of the rest of the neural network. And the rule of thumb is maybe if you have a small data set, then just retrain the one lost layer, the output layer, or maybe the last one or two layers. But if you have a lot of data, then maybe you can retrain all the parameters in the network. And if you retrain all the parameters of the neural network, then this initial phase of training on image recognition is sometimes called pre-training, because you're using image recognition's data to pre-initialize or really pre-train the weights of the neural network. And then if you're updating all the weights afterward, then training on the radiology data, sometimes that's called fine-tuning. So if you hear the words pre-training and fine-tuning in a deep learning context, this is what they mean when they refer to pre-training and fine-tuning weights in a transfer learning task. And what you've done in this example is you've taken knowledge learned from image recognition and applied it or transferred it to radiology diagnosis. And the reason this can be helpful is that a lot of the low-level features, such as detecting edges, detecting curves, detecting positive objects, learning from that, from a very large image recognition data set, might help your learning algorithm do better in radiology diagnosis. It's just learned a lot about the structure and the nature of how images look like, and some of that knowledge will be useful. So having learned to recognize images, it might have learned enough about just what parts of different images look like, that knowledge about lines, dots, curves, and so on, maybe small parts of objects, that knowledge could help your radiology diagnosis network learn a bit faster or learn with less data. Here's another example. Let's say that you've trained a speech recognition system. So now X is inputs of audio or audio snippets, and Y is some transcript. So you've trained a speech recognition system to output transcripts. And let's say that you now want to build a wake word or a trigger word detection system. So recall that a wake word and a trigger word are the words we say in order to wake up speech control devices in our houses, such as saying Alexa to wake up an Amazon Echo or OK Google to wake up a Google device, or Hey Siri to wake up an Apple device, or saying Ni Hao Baidu to wake up a Baidu device. So in order to do this, you might take out the last layer of the neural network again and create a new output node. But sometimes another thing you could do is actually create not just a single new output, but actually create several new layers to your neural network to try to predict the labels Y for your wake word detection problem. And again, depending on how much data you have, you might just retrain the new layers of the network, or maybe you could retrain even more layers of this neural network. So when does transfer learning make sense? Transfer learning makes sense when you have a lot of data for the problem you're transferring from, and usually relatively less data for the problem you're transferring to. So for example, let's say you have a million examples for the image recognition task. So that's a lot of data to learn a lot of low-level features, or to learn a lot of useful features in the earlier layers of the neural network. But for the radiology task, maybe you have only 100 examples. So you have very little data for the radiology diagnosis problem. Maybe you have only 100 x-ray scans. So a lot of the knowledge you learn from image recognition can be transferred and can really help you get going with radiology recognition, even if you don't have a lot of data for radiology. For speech recognition, maybe you've trained a speech recognition system on 10,000 hours of data. So you've learned a lot about what human voices sound like from that 10,000 hours of data, which really is a lot. But for your trigger word detection, maybe you have only one hour of data. So that's not a lot of data to fit a lot of parameters. So in this case, a lot of what you learn about what human voices sound like, what are components of human speech and so on, that can be really helpful for building a good wake word detector, even though you have a relatively small data set, or at least a much smaller data set for the wake word detection task. So in both of these cases, you're transferring from a problem with a lot of data to a problem with relatively little data. One case where transfer learning would not make sense is if the opposite was true. So if you had 100 images for image recognition and you had 100 images for radiology diagnosis, or even 1,000 images for radiology diagnosis, one way to think about it is that to do well on radiology diagnosis, assuming what you really want to do well on is radiology diagnosis, having radiology images is much more valuable than having cat and dog and so on images. So each example here is much more valuable than each example there, at least for the purpose of building a good radiology system. So if you already have more data for radiology, it's not that likely that having 100 images of random objects, of cats and dogs and cars and so on, would be that helpful, because the value of one example from your image recognition task of cats and dogs is just less valuable than one example of an x-ray image for the task of building a good radiology system. So this would be one example where transfer learning, well it might not hurt, but I wouldn't expect it to give you any meaningful gain either. And similarly, if you built a speech recognition system on 10 hours of data and you actually have 10 hours or maybe even more, say 50 hours of data for wake word detection, you know, it may or may not hurt, maybe it won't hurt to include that 10 hours of data to do transfer learning, but you just wouldn't expect to get a meaningful gain. So to summarize, when does transfer learning make sense? If you're trying to learn from some task A and transfer some of the knowledge to some task B, then transfer learning makes sense when tasks A and B have the same input x. In the first example, A and B both had images as input. In the second example, both had audio clips as input. It tends to make sense when you have a lot more data for task A than task B. All this is under the assumption that what you really want to do well on is task B. And because data for task B is more valuable for task B, usually you just need a lot more data for task A because, you know, each example from task A is just less valuable for task B than each example for task B. And then finally, transfer learning will tend to make more sense if you suspect that low-level features from task A could be helpful for learning task B. And in both of the earlier examples, maybe learning image recognition teaches you about images to help with radiology diagnosis, and maybe learning speech recognition teaches you about human speech to help you with trigger word or wake word detection. So to summarize, transfer learning tends to be most useful if you're trying to do well on some task B. Usually a problem where you have relatively little data. So, for example, in radiology, you know, it's difficult to get that many x-ray scans to build a good radiology diagnosis system. So in that case, you might find a related but different task, such as image recognition, where you can get maybe a million images and learn a lot of low-level features from that so that you can then try to do well on task B, on your radiology task, despite not having that much data for it. When transfer learning makes sense, it does help the performance of your learning algorithm significantly, but I've also sometimes seen transfer learning applied in settings where task A actually has less data than task B, and in those cases, you kind of don't expect to see much of a gain. So that's it for transfer learning, where you learn from one task and try to transfer to a different task. There's another version of learning from multiple tasks, which is called multitask learning, which is when you try to learn from multiple tasks at the same time, rather than learning from one and then sequentially or after that, trying to transfer to a different task. So in the next video, let's discuss multitask learning.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: ML Strategy

Error Analysis

Carrying Out Error Analysis
Video
・
10 mins

Cleaning Up Incorrectly Labeled Data
Video
・
13 mins

Build your First System Quickly, then Iterate
Video
・
5 mins

Mismatched Training and Dev/Test Set

Training and Testing on Different Distributions
Video
・
10 mins

Bias and Variance with Mismatched Data Distributions
Video
・
18 mins

Addressing Data Mismatch
Video
・
10 mins

Learning from Multiple Tasks

Transfer Learning
Video
・
11 mins

Multi-task Learning
Video
・
12 mins

End-to-end Deep Learning

What is End-to-end Deep Learning?
Video
・
11 mins

Whether to use End-to-end Deep Learning
Video
・
10 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Machine Learning Flight Simulator (Quiz)

Autonomous Driving (Quiz Case Study)

Graded・Quiz

・

1 hour 15 mins

Heroes of Deep Learning (Optional)

Ruslan Salakhutdinov Interview
Video
・
17 mins

Acknowledgments

Acknowledgments
Reading
・
10 mins

Next in the Professional Certificate

Convolutional Neural Networks