We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
When building an anomaly detection algorithm, I found that choosing a good choice of features turns out to be really important. In supervised learning, if you don't have the features quite right, or if you have a few extra features that are not relevant to the problem, that often turns out to be okay because the algorithm has the supervised signal, that is, enough labels Y for the algorithm to figure out what features to ignore, or how to rescale a feature, and to take the best advantage of the features you do give it. But for anomaly detection, which runs or learns just from unlabeled data, it's harder for the algorithm to figure out what features to ignore. So I've found that carefully choosing the features is even more important for anomaly detection than for supervised learning approaches. Let's take a look in this video at some practical tips for how to tune the features for anomaly detection to try to get you the best possible performance. One step that can help your anomaly detection algorithm is to try to make sure the features you give it are more or less Gaussian. And if your features are not Gaussian, sometimes you can change it to make it a little bit more Gaussian. Let me show you what I mean. If you have a feature X, I will often plot a histogram of the feature, which you can do using the Python command plt.his. You see this in the practice lab as well, in order to look at the histogram of the data. This distribution here looks pretty Gaussian, so this would be a good candidate feature if you think this is a feature that helps distinguish between anomalies and normal examples. But quite often, when you plot a histogram of your features, you may find that a feature has a distribution like this. This does not at all look like that symmetric bell-shaped curve. When that is the case, I would consider if you can take this feature X and transform it in order to make it more Gaussian. For example, maybe if you were to compute the log of X and plot a histogram of log of X, it would look like this, and this looks much more Gaussian. If this feature was feature X1, then instead of using the original feature X1, which looks like this on the left, you might instead replace that feature with log of X1 to get this distribution over here. Because when X1 is made more Gaussian, when anomaly detection models p of X1 using a Gaussian distribution like that is more likely to be a good fit to the data. Other than the log function, other things you might do is, given a different feature X2, you may replace it with X2 log of X2 plus 1. This would be a different way of transforming X2. And more generally, log of X2 plus C would be one example of a formula you can use to change X2 to try to make it more Gaussian. Or for a different feature, you might try taking the square root, or really, the square root of X cubed is X3 to the power of 1 half, and you may change that exponentiation term. So for a different feature X4, you might use X4 to the power of 1 third, for example. So when I'm building an anomaly detection system, I'll sometimes take a look at my features and if I see any that are highly non-Gaussian by plotting a histogram, I might choose transformations like these or others in order to try to make it more Gaussian. It turns out a larger value of C will end up transforming this distribution less. But in practice, I just try a bunch of different values of C and then try to pick one that looks better in terms of making the distribution more Gaussian. Now let me illustrate how I actually do this in a Jupyter Notebook. So this is what the process of exploring different transformations in the features might look like. When you have a feature X, you can plot a histogram of it as follows. It actually looks like this is a pretty coarse histogram. Let me increase the number of bins in my histogram to 50, so bins equals 50. There's more histogram bins. Oh, and by the way, if you want to change the color, you can also do so as follows. And if you want to try a different transformation, you can try, for example, to plot X square root of X, so X to the power of 0.5 with, again, 50 histogram bins, in which case it might look like this. And this actually looks somewhat more Gaussian, but not perfectly. And let's try a different parameter. So let me try to the power of 0.25. Maybe I adjusted a little bit too far to the 0.4. That looks pretty Gaussian. So one thing you could do is replace X with X to the power of 0.4. So you would set X to be equal to X to the power of 0.4 and just use the value of X in your training process instead. Well, let me show you another transformation. Here I'm going to try taking the log of X. So log of X, let's plot it with 50 bins, but I'm going to use the numpy log function as follows. It turns out you get an error because it turns out that X in this example has some values that are equal to 0 and, well, log of 0 is negative infinity, it's not defined. So common trick is to add just a very tiny number there. So X plus 0.001 becomes non-negative. And so you get a histogram that looks like this. And if you want the distribution to look more Gaussian, you can also play around with this parameter to try to see if there's a value that causes the data to look more symmetric and maybe look more Gaussian as follows. And just as I'm doing right now in real time, you can see that you can very quickly change these parameters and plot the histogram in order to try to take a look and try to get something a bit more Gaussian than was the original data X that you saw in this histogram up above. If you read the machine learning literature, there are some ways to automatically measure how close these distributions are to Gaussians, but I've found that in practice, it doesn't make a big difference. If you just try a few values and pick something that looks right to you, that will work well for our practical purposes. So by trying things out in a Jupyter notebook, you can try to pick a transformation that makes your data more Gaussian. And just as a reminder, whatever transformation you apply to the training set, please remember to apply the same transformation to your cross-validation and test set data as well. Other than making sure that your data is approximately Gaussian, after you've trained your anomaly detection algorithm, if it doesn't work that well on your cross-validation set, you can also carry out an error analysis process for anomaly detection. In other words, you can try to look at where the algorithm is not yet doing well, where it's making errors, and then use that to try to come up with improvements. So as a reminder, what we want is for p of x to be large for normal examples x, so greater than or equal to epsilon, and p of x to be small, or less than epsilon, for the anomalous examples x. When you've learned the model p of x from your unlabeled data, the most common problem that you may run into is that p of x is comparable in value, say is large for both normal and for anomalous examples. As a concrete example, if this is your data set, you might fit that Gaussian to it, and if you have an example in your cross-validation set or test set that is over here, that is anomalous, then this has a pretty high probability, and in fact it looks quite similar to the other examples in your training set. And so even though this is an anomaly, p of x is actually pretty large, and so the algorithm will fail to flag this particular example as an anomaly. In that case, what I would normally do is try to look at that example and try to figure out what is it that made me think it's an anomaly, even if this feature, x1, took on values similar to other training examples. And if I can identify some new feature, say x2, that helps distinguish this example from the normal examples, then adding that feature can help improve the performance of the algorithm. Here's a picture showing what I mean. If I can come up with a new feature, x2, say I'm trying to detect fraudulent behavior, and if x1 is the number of transactions they make, maybe this user looks like they're making similar transactions as everyone else. But if I discover that this user has some insanely fast typing speed, and if I were to add a new feature, x2, that is the typing speed of this user. And if it turns out that when I plot this data using the old feature, x1, and this new feature, x2, causes x2 to stand out over here, then it becomes much easier for the anomaly detection algorithm to recognize that x2 is an anomalous user. Because when you have this new feature, x2, the learning algorithm may fit a Gaussian distribution that assigns high probability to points in this region, a bit lower in this region, and a bit lower in this region. And so this example, because of the very anomalous value of x2, becomes easier to detect as an anomaly. So just to summarize, the development process I'll often go through is to train a model and then to see what anomalies in the cross-validation set the algorithm is failing to detect, and then to look at those examples to see if that can inspire the creation of new features that would allow the algorithm to spot that that example takes on unusually large or unusually small values on the new features, so that it can now successfully flag those examples as anomalies. Just as one more example, let's say you're building an anomaly detection system to monitor computers in a data center to try to figure out if a computer may be behaving strangely and deserves a closer look, maybe because of a hardware failure or because it's been hacked into or something. So what you like to do is to choose features that might take on unusually large or small values in the event of an anomaly. You might start off with features like x1 is the memory use, x2 is number of disk accesses per second, then the CPU load, and the volume of network traffic. And if you train the algorithm, you may find that it detects some anomalies but fails to detect some other anomalies. In that case, it's not unusual to create new features by combining old features. So for example, if you find that there's a computer that is behaving very strangely, but neither is CPU load nor network traffic, is that unusual? But what is unusual is it has a really high CPU load while having a very low network traffic volume. If you're running a data center that streams videos, then computers may have high CPU load and high network traffic or low CPU load and no network traffic. But what's unusual about this one machine is it has very high CPU load despite a very low traffic volume. In that case, you might create a new feature, x5, which is a ratio of CPU load to network traffic, and this new feature would help the anomaly detection algorithm flag future examples like the specific machine you may be seeing as anomalous. Or you can also consider other features like the square of the CPU load divided by the network traffic volume, and you can play around with different choices of these features in order to try to get it so that p of x is still large for the normal examples, but it becomes small in the anomalies in your cross-validation set. So that's it. Thanks for sticking with me to the end of this week. I hope you enjoyed hearing about both clustering algorithms and anomaly detection algorithms, and that you also enjoy playing with these ideas in the practice labs. Next week, we'll go on to talk about recommender systems. When you go to a website and it recommends products or movies or other things to you, how does that algorithm actually work? This is one of the most commercially important algorithms in machine learning that gets talked about surprisingly little, but next week, we'll take a look at how these algorithms work so that you understand the next time you go to a website and it recommends something to you, maybe how that came about, as well as you'll be able to build other algorithms like that for yourself as well. So have fun with the labs, and I look forward to seeing you next week.