We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Now that you've seen how the Gaussian or the normal distribution works for a single number, we're ready to build our anomaly detection algorithm. Let's dive in. You have a training set, X1 through Xm, where here each example X has n features. So each example X is a vector with n numbers. In the case of the airplane engine example, we had two features corresponding to the heat and the vibrations, and so each of these Xi's would be a two-dimensional vector, and n would be equal to 2. But for many practical applications, n can be much larger, and you might do this with dozens or even hundreds of features. Given this training set, what we would like to do is to carry out density estimation, and all that means is we will build a model or estimate the probability for P of X. What's the probability of any given feature vector? And our model for P of X is going to be as follows. X is a feature vector with values X1, X2, and so on, down to Xm. I'm going to model P of X as the probability of X1 times the probability of X2 times the probability of X3 times the probability of Xn for the n features in the feature vectors. If you've taken an advanced class in probability and statistics before, you may recognize that this equation corresponds to assuming that the features X1, X2, and so on up to Xm are statistically independent, but it turns out this algorithm often works fine even if the features are not actually statistically independent. But if you don't understand what I just said, don't worry about it. Understanding statistical independence is not needed to fully complete this class and also be able to very effectively use the normal detection algorithm. Now, to fill in this equation a little bit more, we are saying that the probability of all the features of this vector features X is the product of P of X1 and P of X2 and so on up through P of Xn. And in order to model the probability of X1, say the heat feature in this example, we're going to have two parameters, mu1 and sigma1, or sigma squared 1. And what that means is we're going to estimate the mean of the feature X1 and also the variance of feature X1, and that will be mu1 and sigma1. To model P of X2, X2 is a totally different feature. Measuring the vibrations of the airplane engine, we're going to have two different parameters, which I'm going to write as mu2, sigma2 squared. And it turns out this will correspond to the mean or the average of the vibration feature and the variance of the vibration feature and so on. If you have additional features, mu3, sigma3 squared up through mun and sigma n squared. In case you're wondering why we multiply probabilities, maybe here's one example that could build intuition. Suppose for an aircraft engine, there's a one-tenth chance that it is really, really hot, unusually hot. And maybe there is a 1 in 20 chance that it vibrates really, really hot. Then what is the chance that it runs really, really hot and vibrates really, really hot? We're saying that the chance of that is one-tenth times 1 over 20, which is 1 over 200. So it's really, really unlikely to get an engine that both runs really hot and vibrates really hot. It's the product of these two probabilities. The chance of both of these things happening, we're saying, is the product of both of these probabilities. A somewhat more compact way to write this equation up here is to say that this is equal to the product from j equals 1 through n of p of xj, with parameters mu j and sigma squared j. And this symbol here is a lot like the summation symbol, except that whereas the summation symbol corresponds to addition, this symbol here corresponds to multiplying these terms over here for j equals 1 through n. So let's put it all together to see how you can build an anomaly detection system. The first step is to choose features xi that you think might be indicative of anomalous examples. Having come up with the features you want to use, you would then fit the parameters mu 1 through mu n and sigma squared 1 through sigma squared n for the n features in your data set. As you might guess, the parameter mu j will be just the average of xj of the feature j of all the examples in your training set, and sigma squared j will be the average of the squared difference between the j-th feature and the value mu j that you just computed up here on top. And by the way, if you have a vectorized implementation, you can also compute mu as the average of the training examples as follows, where here x and mu are both vectors, and so this would be the vectorized way of computing mu 1 through mu n all at the same time. And by estimating these parameters on your unlabeled training set, you've now computed all the parameters of your model. Finally, when you are given a new example, x test, or I'm just going to write the new example as x here, what you would do is compute p of x and see if it's large or small. So p of x, as you saw on the last slide, is the product from j equals 1 through n of the probability of the individual features, so p of xj with parameters mu j and sigma squared j. And if you substitute in the formula for this probability, you end up with this expression, 1 over root 2 pi sigma j of e to this expression over here. And so xj are the features, this is the j feature of your new example, mu j and sigma j are numbers or parameters you have computed in the previous step. And if you compute out this formula, you get some number for p of x. And the final step is to see if p of x is less than epsilon, and if it is, then you flag that it is an anomaly. One intuition behind what this algorithm is doing is that it will tend to flag an example as anomalous if one or more of the features are either very large or very small relative to what it has seen in the training set. So for each of the features xj, you're fitting a Gaussian distribution like this, and so if even one of the features of the new example was way out here, say, then p of xj would be very small, and if just one of the terms in this product is very small, then this overall product, when you multiply it together, will tend to be very small, and thus p of x will be small. And what anomaly detection is doing in this algorithm is a systematic way of quantifying whether or not this new example x has any features that are unusually large or unusually small. Now, let's take a look at what all this actually means on one example. Here's a dataset with features x1 and x2, and you notice that the features x1 take on a much larger range of values than the features x2. If you were to compute the mean of the features x1, you end up with 5, which is why mu1 is equal to 1, and it turns out that for this dataset, if you compute sigma1, it will be equal to about 2, and if you were to compute mu2, the average of the features on x2, the average is 3, and similarly, its variance, or standard deviation, is much smaller, which is why sigma2 is equal to 1. So that corresponds to this Gaussian distribution for x1 and this Gaussian distribution for x2. If you were to actually multiply p of x1 and p of x2, then you end up with this 3D surface plot for p of x, where at any point, the height of this is the product of p of x1 times p of x2 for the corresponding values of x1 and x2, and this signifies that values where p of x is higher are more likely, so values near the middle, kind of here, are more likely, whereas values far out here, like values out here, are much less likely, are much lower chance. Now, let me pick two test examples. The first one here, I'm going to write as xtest1, and the second one down here as xtest2, and let's see which of these two examples the algorithm will flag as anomalous. I'm going to pick the parameter epsilon to be equal to 0.02, and if you were to compute p of xtest1, it turns out to be about 0.04, and this is much bigger than epsilon, and so the algorithm will say, this looks okay, doesn't look like an anomaly, whereas in contrast, if you were to compute p of x for this point down here, corresponding to x1 equals about 8 and x2 equals about 0.5, kind of down here, then p of xtest2 is 0.0021, so this is much smaller than epsilon, and so the algorithm will flag this as a likely anomaly. So, pretty much as you might hope, it decides that xtest1 looks pretty normal, whereas xtest2, which is much further away than anything you've seen in the training set, looks like it could be an anomaly. So, you've seen the process of how to build an anomaly detection system, but how do you choose the parameter epsilon, and how do you know if your anomaly detection system is working well? In the next video, let's dive a little bit more deeply into the process of developing and evaluating the performance of an anomaly detection system. Let's go on to the next video.