Estimating the bias and variance of your learning algorithm really helps you prioritize what to work on next. But the way you analyze bias and variance changes when your training set comes from a different distribution than your depth and test set. Let's see how. Let's keep using our CAD classification example, and let's say humans get near-perfect performance on this. So Bayes' error, or Bayes' optimal error, we know is nearly 0% on this problem. So to carry out error analysis, you usually look at the training error and also look at the error on the depth set. So let's say, in this example, that your training error is 1%, and your depth error is 10%. If your depth data came from the same distribution as your training set, you would say that here you have a large variance problem, that your algorithm is just not generalizing well from the training set, which it's doing well on, to the depth set, which it's suddenly doing much worse on. But in the setting where your training data and your depth data comes from different distributions, you can no longer safely draw this conclusion. In particular, maybe it's doing just fine on the depth set. It's just that the training set was really easy because it was high-res, very clear images, and maybe the depth set is just much harder. So maybe there isn't a variance problem, and this just reflects that the depth set contains images that are much more difficult to classify accurately. So the problem with this analysis is that when you went from the training error to the depth error, two things changed at a time. One is that the algorithm saw data in the training set but not in the depth set. Two, the distribution of data in your depth set is different. And because you changed two things at the same time, it's difficult to know of this 9% increase in error, how much of it is because the algorithm didn't see the data in the depth set, so that's sort of the variance part of the problem, and how much of it is because the depth set data is just different. So in order to tease out these two effects, and if you didn't totally follow what these two different effects are, don't worry, we'll go over it again in a second. But in order to tease out these two effects, it will be useful to define a new piece of data which we'll call the training depth set. So this is a new subset of data which you carve out that should have the same distribution as training sets, but you don't explicitly train your neural network on this. So here's what I mean. Previously, we had set up some training sets and some depth sets and some test sets as follows. And the depth and test sets have the same distribution, but the training sets will have some different distribution. What we're going to do is randomly shuffle the training set and then carve out just a piece of the training set to be the training dash depth set. So just as the depth and test set have the same distribution, the training set and the training depth set also have the same distribution. But the difference is that now you train your neural network just on the training set proper. You won't let the neural network, you won't run back propagation on the training depth portion of this data. To carry out error analysis, what you should do is now look at the error of your classifier on the training set, on the training depth set, as well as on the depth set. So let's say in this example that your training error is 1%. And let's say the error on the training depth set is 9%. And the error on the depth set is 10%, same as before. What you can conclude from this is that when you went from training data to training depth data, the error really went up a lot. And the only difference between the training data and the training depth data is that your neural network got to solve the first part of this. It was trained explicitly on this, but it wasn't trained explicitly on the training depth data. So this tells you that you have a variance problem. Because the training depth error was measured on data that comes from the same distribution as your training set. So you know that even though your neural network does well in the training set, it's just not generalizing well to data in the training depth set, which comes from the same distribution. But it's just not generalizing well to data from the same distribution that it hadn't seen before. So in this example, you have really a variance problem. Let's look at a different example. Let's say the training error is 1%, and the training depth error is 1.5%. But when you go to the depth set, your error is 10%. So now you have actually a pretty low variance problem. Because when you went from training data that you've seen to the training depth data that the neural network has not seen, the error increases only a little bit. But then it really jumps when you go to the depth set. So this is a data mismatch problem. Write data mismatch. So this is a data mismatch problem. Because your learning algorithm was not trained explicitly on data from training depth or depth, but these two data sets come from different distributions. But whatever your algorithm is learning, it works great on training depth, but it doesn't work well on depth. So somehow your algorithm has learned to do well on a different distribution than what you really care about. So I'm going to call that a data mismatch problem. Let's just look at a few more examples. I'll write this on the next row since I'm running out of space on top. So training error, training depth error, and depth error. Let's say that training error is 10%, training depth error is 11%, and depth error is 12%. Remember that human level proxy for Bayes error is roughly 0%. So if you have this type of performance, then you really have an avoidable bias problem. Because you're doing much worse than human level. So this is really a high bias setting. And one last example, if your training error is 10%, your training depth error is 11%, and your depth error is 20%, then it looks like this actually has two issues. One, the avoidable bias is quite high because you're not even doing that well on the training set. Humans get nearly 0% error, but you're getting 10% error on the training set. The variance here seems quite small, but this data mismatch is quite large. So for this example, I would say you have a large bias or avoidable bias problem as well as a data mismatch problem. So let's take what we've done on this slide and write out the general principles. The key quantities I would look at are human level error, your training set error, your training depth set error. So that's the same distribution as the training set, but you didn't train explicitly on it. Your depth set error, and depending on the differences between these errors, you can get a sense of how big is the avoidable bias, the variance, the data mismatch problems. So let's say that human level error is 4%, your training error is 7%, your training depth error is 10%, and your depth error is 12%. So this gives you a sense of the avoidable bias. Because you'd like your algorithm to do at least as well or approach human level performance maybe on the training set. This is a sense of the variance, so how well do you generalize from the training set to the training depth set. This is a sense of how much of a data mismatch problem you have. And technically you could also add one more thing, which is the test set performance. I'm going to write test error. You shouldn't be doing development on your test set because you don't want to overfit your test set. But if you also look at this, then this gap here tells you the degree of overfitting to the depth set. So there's a huge gap between your depth set performance and your test set performance. It means you've maybe overtuned to the depth set, and so maybe you need to find a bigger depth set. So remember that your depth set and your test set come from the same distribution. So the only way for there to be a huge gap here, for it to do much better on the depth set than the test set, is if you somehow manage to overfit the depth set. And if that's the case, what you might consider doing is going back and just getting more depth set data. Now, I've written these numbers as if, as you go down the list, the numbers always keep going up. Here's one example of numbers that doesn't always go up. Maybe human level performance is 4%, training error is 7%, training depth error is 10%. But let's say that when you go to the depth set, you find that you actually, surprisingly, do much better on the depth set. Maybe this is 6%, 6% as well. So I've seen effects like this working on, for example, a speech recognition task, where the training data turned out to be much harder than your depth set and test set. So these two were evaluated on your training set distribution, and these two were evaluated on your depth test set distribution. So sometimes if your depth test set distribution is much easier for whatever application you're working on, then these numbers can actually go down. So if you see funny things like this, there's an even more general formulation of this analysis that might be helpful. Let me quickly explain that on the next slide. So let me motivate this using the speech activated rear view mirror example. It turns out that the numbers we've been writing down can be placed into a table, where on the horizontal axis, I'm going to place different data sets. So, for example, you might have data from your general speech recognition tasks. So you might have a bunch of data that you've just collected from a lot of speech recognition problems you've worked on, from smart speakers, data you've purchased, and so on. And then you also have the rear view mirror specific speech data recorded inside the car. So on this x-axis on the table, I'm going to vary the data set. On this other axis, I'm going to label different ways or algorithms for examining the data. So first, there's human level performance, which is how accurate are humans on each of these data sets. Then there is the error on examples that your neural network has trained on. And then finally, there's error on examples that your neural network has not trained on. So it turns out that what we're calling on the human level on the previous slide is the number that goes in this box, which is how well do humans do on this category of data, say data from all sorts of speech recognition tasks, the 500,000 utterances that you could power into your training set. And in the example on the previous slide, this is 4%. This number here was maybe the training error, which in the example on the previous slide was 7%. If your learning algorithm has seen this example, performed gradient descent on this example, and this example came from your training set distribution or some general speech recognition distribution, how well does your algorithm do on the example it has trained on? Then here is the training dev set error. It's usually a bit higher, which is for data from this distribution, from general speech recognition, if your algorithm did not train explicitly on some examples from this distribution, how well does it do? And that's what we call the training dev error. And then if you move over to the right, this box here is the dev set error, or maybe also the test set error, which was 6% in the example just now. And dev and test set error are actually technically two numbers, but either one could go to this box here. And this is if you have data from the rearview mirror, from actually recorded in a car from the rearview mirror application, but your neural network did not perform back propagation on this example, what is the error? So what we're doing in the analysis in the previous slide was look at differences between these two numbers, these two numbers, and these two numbers. And this gap here is a measure of avoidable bias. This gap here is a measure of variance. And this gap here was a measure of data mismatch. And it turns out that it could be useful to also fill in the remaining two entries in this table. And so if this turns out to be also 6%, and the way you get this number is you ask some humans to label their rearview mirror speech data and just measure how good humans are at this task. And maybe this turns out also to be 6%, and the way you do that is you take some rearview mirror speech data, put it in a training set so the neural network learns on it as well, and then you measure the error on that subset of the data. But if this is what you get, then well it turns out that you're actually already performing at the level of humans on this rearview mirror speech data. So maybe you're actually doing quite well on that distribution of data. When you do this more subscriptive analysis, it doesn't always give you a one clear powerful way, but sometimes it just gives you additional insights as well. So for example, comparing these two numbers in this case tells us that for humans, the rearview mirror speech data is actually harder than for general speech recognition because humans get 6% error rather than 4% error. But then looking at these differences as well may help you understand the bias and variance and data mismatch problems to different degrees. So this more general formulation is something I've used a few times. I've not used it, but for a lot of problems, you find that examining this subset of entries, kind of looking at this difference and this difference and this difference, that that's enough to point you in a pretty promising direction. But sometimes filling out this whole table can give you additional insights. Finally, we've previously talked a lot about ideas for addressing bias, talked a lot about techniques for addressing variance, but how do you address data mismatch? In particular, training on data that comes from a different distribution, that your dev and test set, can get you more data and really help your learning algorithms performance. But rather than just bias and variance problems, you now have this new potential problem of data mismatch. What are some good ways that you could use to address data mismatch? I'll be honest and say there actually aren't great, or at least not very systematic ways to address data mismatch, but there are some things you could try that could help. Let's take a look at them in the next video. So what we've seen is that by using training data that can come from a different distribution as a dev and test set, this could give you a lot more data and therefore help the performance of your learning algorithm. But instead of just having bias and variance as two potential problems, you now have this third potential problem, data mismatch. So what if you perform error analysis and conclude that data mismatch is a huge source of error? How can you go about addressing that? It turns out that unfortunately there aren't super systematic ways to address data mismatch, but there are a few things you can try that could help. Let's take a look at them in the next video.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: ML Strategy

Error Analysis

Carrying Out Error Analysis
Video
・
10 mins

Cleaning Up Incorrectly Labeled Data
Video
・
13 mins

Build your First System Quickly, then Iterate
Video
・
5 mins

Mismatched Training and Dev/Test Set

Training and Testing on Different Distributions
Video
・
10 mins

Bias and Variance with Mismatched Data Distributions
Video
・
18 mins

Addressing Data Mismatch
Video
・
10 mins

Learning from Multiple Tasks

Transfer Learning
Video
・
11 mins

Multi-task Learning
Video
・
12 mins

End-to-end Deep Learning

What is End-to-end Deep Learning?
Video
・
11 mins

Whether to use End-to-end Deep Learning
Video
・
10 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Machine Learning Flight Simulator (Quiz)

Autonomous Driving (Quiz Case Study)

Graded・Quiz

・

1 hour 15 mins

Heroes of Deep Learning (Optional)

Ruslan Salakhutdinov Interview
Video
・
17 mins

Acknowledgments

Acknowledgments
Reading
・
10 mins

Next in the Professional Certificate

Convolutional Neural Networks