Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
You've seen how setting up a dev set and evaluation metric is like placing a target somewhere for your team to aim at. But sometimes, partway through a project, you might realize you put your target in the wrong place. In that case, you should move your target. Let's take a look at an example. Let's say you built a cat classifier to try to find lots of pictures of cats to show to your cat-loving users. And the metric you've decided to use is classification error. So, algorithms A and B have respectively 3% error and 5% error. So, it seems like algorithm A is doing better. But let's say you try out these algorithms, you look at these algorithms, and algorithm A, for some reason, is letting through a lot of pornographic images. So, if you ship algorithm A, the users would see more cat images because, you know, 3% error at identifying cats. But it also shows the users some pornographic images, which is totally unacceptable, both for your company as well as for your users. And in contrast, algorithm B has 5% error, so it misclassifies fewer images, but it doesn't have pornographic images. So, from your company's point of view as well as from a user acceptance point of view, algorithm B is actually a much better algorithm because, you know, it's not letting through any pornographic images. So, what's happened in this example is that algorithm A is doing better on the evaluation metric. It's getting 3% error, but it's actually a worse algorithm. So, in this case, the evaluation metric plus the dev set, it prefers algorithm A because it's saying, look, algorithm A has lower error, which is the metric you're using, but your you and your users prefer algorithm B because it's not letting through pornographic images. So, when this happens, when your evaluation metric is no longer correctly rank ordering preferences between algorithms, in this case it's mispredicting that algorithm A is a better algorithm, then that's a sign that you should change your evaluation metric, or perhaps your development set or test set. So, in this case, the misclassification error metric that you're using can be written as follows. It's 1 over m, number of examples in your development set, of sum from i equals 1 to m, m dev, number of examples in the development set, of indicator of whether or not the prediction on the example i in the development set is not equal to the actual label i, where this is, we're going to use this notation to denote the predicted value. So, either a 0 or, and this indicator function notation counts up the number of examples on which this thing inside is true. So, this formula just counts up the number of misclassified examples. So, the problem with this evaluation metric is that it treats pornographic and non-pornographic images equally, but you really want your classifier to not mislabel pornographic images, like maybe recognize a pornographic image as a cat image and therefore show it to an unsuspecting user that's therefore very unhappy with unexpectedly seeing porn. So, one way to change this evaluation metric would be if you add a weight term here, we call this w i, where w i is going to be equal to 1 if x i is non-porn, and maybe 10, or maybe even a larger number like 100 if x i is porn. So, this way you're giving a much larger weight to examples that are pornographic, so that the error term goes up much more if the algorithm makes a mistake on classifying a pornographic image as a cat image. And in this example, you're giving a 10 times bigger weight to classifying pornographic images correctly. And if you want this normalization constant to still be right, technically this becomes sum over i of w i, so then this error will still be between 0 and 1. So, the details of this weighting aren't important. And to actually implement this weighting, you need to actually go through your dev and test sets to label the pornographic images in your dev and test sets so you can implement this weighting function. But the high level takeaway is if you find that your evaluation metric is not giving the correct rank order preference for what is actually a better algorithm, then that's the time to think about defining a new evaluation metric. And this is just one possible way that you could define a new evaluation metric. But the goal of the evaluation metric is to accurately tell you, given two classifiers, which one is better for your application. For the purpose of this video, don't worry too much about the details of how we defined a new error metric. The point is that if you're not satisfied with your old error metric, then don't keep coasting with an error metric you're unsatisfied with. Instead, try to define a new one that you think better captures your preferences in terms of what's actually a better algorithm. One thing you might notice is that so far, we've only talked about how to define a metric to evaluate classifiers. That is, we've defined an evaluation metric that helps us better rank order classifiers when they are performing at varying levels in terms of screening our point. And this is actually an example of a formalization where I think you should take a machine learning problem and break it into distinct steps. So one knob or one step is to figure out how to define a metric that captures what you want to do. And I would worry separately about how to actually do well on this metric. So I think of the machine learning task as two distinct steps. To use the target analogy, the first step is to place the target. So define where you want to aim, and then it's a completely separate step. So this is one knob you could tune, which is how do you place the target. And it's a completely separate problem. Think of it as a separate knob to tune in terms of how to do well at this target or how to aim accurately or how to shoot at the target. And so defining the metric is step one, and you do something else for step two. So in terms of shooting at a target, maybe your learning algorithm is optimizing some cost function that looks like this. When you're minimizing this, you know, sum of losses on your training set. So one thing you could do is to also modify this in order to incorporate these weights. And maybe you end up changing this normalization constant as well, as I guess 1 over sum of Wi. Again, the details of how you define J aren't important, but the point was with the philosophy of formalization, think of placing the target as one step and aiming and shooting at the target as a distinct step, which you do separately. So in other words, I encourage you to think of defining the metric as one step, and then you can, and only after you've defined the metric, figure out how to do well on that metric, which might be changing the cost function J that your neural network is optimizing. Before going on, let's look at just one more example. Let's say that your two CAD classifiers, A and B, have respectively 3% error and 5% error as evaluated on your dev set, or maybe even on your test set, which are images downloaded off the internet. So high quality, well-framed images. But maybe when you deploy your algorithmic product, you find that algorithm B actually looks like it's performing better, even though it's doing better on your dev set. And you find that you've been training off very nice high quality images, downloaded off the internet. But when you deploy this on a mobile app, users are uploading all sorts of pictures. They're much less framed. You're going to be proud of the cat. The cat has funny facial expressions. Maybe images are much blurrier. And when you test out your algorithms, you find that algorithm B is actually doing better. So this would be another example of your metric and dev test set falling down. And the problem is that you're evaluating on the dev and test sets of very nice high resolution, well-framed images. But what your users really care about is the algorithm doing well on images they're uploading, which are maybe less professionally shot, and blurrier, and less well-framed. So the guideline is, if doing well on your metric and your current dev sets, or dev and test sets distribution, if that does not correspond to doing well on the application you actually care about, then change your metric and or your dev test set. So in other words, if you discover that your dev test set has these very high quality images, but evaluating on this dev test set is not predictive of how well your app actually performs, because your app needs a view of lower quality images, then that's a good time to change your dev test set so that your data better reflects the type of data you actually need to do well on. But your overall guideline is really, if your current metric and data you're evaluating on doesn't correspond to doing well on what you actually care about, then change your metric and or your dev test set to better capture what you need the algorithm to actually do well on. Having an evaluation metric and a dev set allows you to much more quickly make decisions about is algorithm A or algorithm B better, so it really speeds up how quickly you or your team can iterate. So my recommendation is, even if you can't define the perfect evaluation metric, and dev set, just set something up quickly and use that to drive the speed of your team iterating. And if later down the line you find out that it wasn't a good one, you have a better idea, change it at that time, it's perfectly okay. But what I recommend against for most teams is to run for too long without an evaluation metric and dev set because that can slow down the efficiency of which your team can iterate and improve your algorithm. So that's it on when to change your evaluation metric and, or dev and test sets. Hope that these guidelines help you set up your whole team to have a well-defined target that you can iterate efficiently toward improving performance on.