Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
The softmax regression algorithm is a generalization of logistic regression, which is a binary classification algorithm, to the multicost classification context. Let's take a look at how it works. Recall that logistic regression applies when Y can take on two possible output values, either 0 or 1. The way it computes its output is, you would first calculate Z equals W dot product with X plus B, and then you would compute what I'm going to call here A equals G of Z, which is a sigmoid function applied to Z. We interpreted this as logistic regression's estimate of the probability of Y being equal to 1, given those input features X. Now, quick quiz question. If the probability of Y equals 1 is 0.71, then what is the probability that Y is equal to 0? Well, the chance of Y being equal to 1 and the chance of Y being equal to 0, they've got to add up to 1, right? So if there's a 71% chance of it being 1, there has to be a 29% or 0.29 chance of it being equal to 0. So to embellish logistic regression a little bit, in order to set us up for the generalization to softmax regression, I'm going to think of logistic regression as actually computing two numbers. First, A1, which is this quantity that we had previously, of the chance of Y being equal to 1 given X. And second, I'm going to think of logistic regression as also computing A2, which is 1 minus this, which is just the chance of Y being equal to 0, given the input features X. And so A1 and A2, of course, have to add up to 1. Let's now generalize this to softmax regression. And I'm going to do this with a concrete example of when Y can take on four possible outputs. So Y can take on the values 1, 2, 3, or 4. Here's what softmax regression will do. It will compute Z1 as W1 dot product of X plus B1, and then Z2 equals W2 dot product of X plus B2, and so on for Z3 and Z4. Here, W1, W2, W3, W4, as well as B1, B2, B3, B4. These are the parameters of softmax regression. Next, here's the formula for softmax regression. We'll compute A1 equals E to the Z1 divided by E to the Z1 plus E to the Z2 plus E to the Z3 plus E to the Z4. And A1 will be interpreted as the algorithm's estimate of the chance of Y being equal to 1, given the input features X. Then the formula for softmax regression will compute A2 equals E to the Z2 divided by the same denominator, E to the Z1 plus E to the Z2 plus E to the Z3 plus E to the Z4, and will interpret A2 as the algorithm's estimate of the chance that Y is equal to 2, given the input features X. And similarly for A3, where here the numerator is now E to the Z3 divided by the same denominator. That's the estimated chance of Y being equal to 3. And similarly, A4 takes on this expression. Whereas on the left, we wrote down the specification for the logistic regression model, these equations on the right are our specification for the softmax regression model. It has parameters W1 through W4 and B1 through B4. And if you can learn appropriate choices for all these parameters, then this gives you a way of predicting what's the chance of Y being 1, 2, 3, or 4, given a set of input features X. Quick quiz. Let's say you run softmax regression on a new input X, and you find that A1 is 0.30, A2 is 0.20, A3 is 0.15. What do you think A4 will be? Why don't you take a look at this quiz and see if you can figure out the right answer. So you might have realized that because the chance of Y taking on the values of 1, 2, 3, or 4, they have to add up to 1. A4, the chance of Y being equal to 4, has to be 0.35, which is 1 minus 0.3 minus 0.2 minus 0.15. So here I wrote down the formulas for softmax regression in the case of 4 possible outputs, and let's now write down the formula for the general case for softmax regression. In the general case, Y can take on N possible values, so Y can be 1, 2, 3, and so on, up to N. In that case, softmax regression will compute Zj equals Wj dot product with X plus Bj, where now the parameters of softmax regression are W1, W2, through Wn, as well as B1, B2, through Bn. And then finally, it will compute Aj equals e to the Zj divided by sum from k equals 1 to N of e to the Z sub k. Well, here I'm using another variable k to index the summation because here j refers to a specific fixed number like j equals 1. Aj is interpreted as the model's estimate that Y is equal to j given the input feature is X. And notice that by construction of this formula, if you add up A1, A2, all the way through An, these numbers always will end up adding up to 1. So we'll specify how you would compute the softmax regression model. And I won't prove it in this video, but it turns out that if you apply softmax regression with N equals 2, so there are only two possible output clauses, then softmax regression ends up computing basically the same thing as logistic regression. The parameters end up being a little bit different, but it ends up reducing to a logistic regression model. But that's why the softmax regression model is a generalization of logistic regression. Having defined how softmax regression computes its outputs, let's now take a look at how to specify the cost function for softmax regression. Recall for logistic regression, this is what we had. We said Z is equal to this, and then I wrote earlier that A1 is G of Z, it was interpreted as the probability that Y is equal to 1. And we also wrote A2 is the probability that Y is equal to clause 0. So previously we had written the loss of logistic regression as negative Y log A1 minus 1 minus Y log 1 minus A1. But 1 minus A1 is also equal to just A2, because A2 is 1 minus A1 according to this expression over here. So I can rewrite or simplify the loss for logistic regression a little bit to be negative Y log A1 minus 1 minus Y log of A2. And in other words, the loss if Y is equal to 1 is negative log A1. And if Y is equal to 0, then the loss is negative log A2. And then same as before, the cost function for all the parameters in the model is the average loss, average over the entire training set. So that was the cost function for logistic regression. Let's write down the cost function that is conventionally used for softmax regression. Recall that these are the equations we use for softmax regression. The loss we're going to use for softmax regression is just this. The loss for if the algorithm outputs A1 through AN, and the ground truth label is Y, is if Y equals 1, the loss is negative log A1. So it's negative log of the probability that it thought Y was equal to 1. Or if Y is equal to 2, then the loss I'm going to define as negative log A2. So if Y is equal to 2, the loss of the algorithm on this example is negative log of the probability it thought Y was equal to 2. And so on all the way down to if Y is equal to N, then the loss is negative log of AN. And to illustrate what this is doing, if Y is equal to J, then the loss is negative log of AJ. And that's what this function looks like. Negative log of AJ is a curve that looks like this. And so if AJ was very close to 1, then you'd be on this part of the curve and the loss would be very small. But if it thought, say, AJ had only a 50% chance, then the loss gets a little bit bigger. And the smaller AJ is, the bigger the loss. And so this incentivizes the algorithm to make AJ as large as possible, as close to 1 as possible. Because whatever the actual value Y was, you want the algorithm to say, hopefully, that the chance of Y being that value was pretty large. Notice that in this loss function, Y in each training example can take on only one value. And so you end up computing this negative log of AJ only for one value of AJ, which is whatever was the actual value of Y equals J in that particular training example. For example, if Y was equal to 2, you end up computing negative log of A2, but not any of the other negative log of A1 or the other terms here. So that's the form of the model, as well as the cost function for softmax regression. And if you were to train this model, you can start to build multicost classification algorithms. And what we'd like to do next is take this softmax regression model and fit it into a neural network so that you're able to do something even better, which is to train a neural network for multicost classification. Let's go do that in the next video.