So far, the classification examples we've talked about have used binary classification, where you had two possible labels, 0 or 1. Is it a cat or is it not a cat? What if you have multiple possible clauses? There's a generalization of logistic regression called softmax regression that lets you make predictions where you're trying to recognize one of C or one of multiple clauses rather than just recognize two clauses. Let's take a look. Let's say that instead of just recognizing cats, you want to recognize cats, dogs, and baby chicks. So I'm going to call cats clause 1, dogs clause 2, baby chicks clause 3. And if there's none of the above, then there's an other or none of the above clause, which I'm going to call clause 0. So here's an example of the images and the clauses they belong to. That's a picture of a baby chick. So the clause is 3. Cats is clause 1. A dog is clause 2. That's a, I guess that's a koala. So that's a none of the above. So that's clause 0, clause 3, and so on. So the notation we're going to use is, I'm going to use capital C to denote the number of clauses you're trying to categorize your inputs into. And in this case, you have four possible clauses, including the other or the none of the above clause. So when you have four clauses, the numbers indexing your clauses will be 0 through C minus 1, through capital C minus 1. So in other words, it will be 0, 1, 2, or 3. In this case, we're going to build a neural network where the outputs layer has four, or in this case, the variable capital alphabet C output units. So N, the number of units in the output layer, which is layer L, is going to be equal to 4, or more generally, it's going to be equal to C. And what we want is for the number of units in the output layer to tell us what is the probability of each of these four clauses. So the first node here is supposed to output, or we want it to output, the probability that is the other clause given the input X. This will output probability that it's a cat given X. This will output probability that it's a dog given X. That will output the probability. I'm just going to abbreviate baby cake to be BC. So probably you have a baby cake abbreviated BC given the input X. So here, the output labels Y hat is going to be a 4 by 1 dimensional vector because it now has to output four numbers giving you these four probabilities. And because probabilities should sum to 1, the four numbers in the output Y hat, they should sum to 1. The standard model for getting a neural network to do this uses what's called a softmax layer in the output layer in order to generate these outputs. Let me write down the map and then we'll come back and get some intuition about what the softmax layer is doing. So in the final layer of the neural network, you are going to compute as usual the linear part of the layer. So Z capital L, that's the Z variable for the final layer. So remember, this is layer capital L. So as usual, you compute that as WL times the activation of the previous layer plus the biases for that final layer. Now, having computed Z, you now need to apply what's called the softmax activation function. So the activation function is a bit unusual for the softmax layer, but this is what it does. First, we're going to compute a temporary variable, which we're going to call T, which is e to the ZL. So this is applied element-wise. So ZL here, in our example, ZL is going to be 4 by 1. It's a four-dimensional vector. So T itself, e to the ZL, that's an element-wise exponentiation. T will also be a 4 by 1 dimensional vector. Then the output, AL, is going to be basically the vector T, but normalized to sum to 1. So AL is going to be e to the ZL divided by sum from j equals 1 through 4, because we have four classes of T subscript i. So in other words, saying this is that AL is also a 4 by 1 vector, and the i-th element of this four-dimensional vector, let's write that AL subscript i, this is going to be equal to Ti over sum of Ti. In case this map isn't clear, we'll do an example in a minute that will make this clearer. So in case this map isn't clear, let's go through a specific example that will make this clearer. Let's say that you've computed ZL, and ZL is a four-dimensional vector. Let's say it's 5, 2, negative 1, 3. What we're going to do is use this element-wise exponentiation to compute this vector T. So T is going to be e to the 5, e to the 2, e to the negative 1, e to the 3. And if you plug that into your calculator, these are the values you get. e to the 5 is 148.4, e squared is about 7.4, e to the negative 1 is 0.4, and e cubed is 20.1. And so the way we go from the vector T to the vector AL is to just normalize these entries to sum to 1. So if you sum up the elements of T, if you just add up those whole numbers, you get 176.3. So finally, AL is just going to be this vector T as a vector, divided by 176.3. So for example, this first node here, this will output e to the 5 divided by 176.3, and that turns out to be 0.842. So saying that for this image, if this is the value of Z you get, the chance of it being caused 0 is 84.2%. And then the next node outputs e squared over 176.3, that turns out to be 0.042, so it's a 4.2% chance. The next one is e to the negative 1 over that, which is 0.42, and the final one is e cubed over that, which is 0.114. So about an 11.4% chance that this is class number 3, which I guess is the baby chick class, right? So there's a chance of it being class 0, class 1, class 2, class 3. So the output of the neural network AL, this is also y hat, this is a 4 by 1 vector, where the elements of this 4 by 1 vector are going to be these four numbers that we just computed. So this algorithm takes the vector ZL and maps it to four probabilities that sum to 1. And if we summarize what we just did to map from ZL to AL, this whole computation, computing the exponentiation to get this temporary variable t, and then normalizing, we can summarize this into a softmax activation function, and say AL equals the activation function G applied to the vector ZL. The unusual thing about this particular activation function is that this activation function G, it takes as input a 4 by 1 vector and it outputs a 4 by 1 vector. So previously, our activation functions used to take in a single real value input. So for example, the sigmoid and the value activation functions, input a real number and output a real number. The unusual thing about the softmax activation function is because it needs to normalize across the different possible outputs, it needs to take in a vector of inputs and it outputs a vector. So what are the things that a softmax classifier can represent? I'm going to show you some examples where you have inputs x1, x2, and these feed directly to a softmax layer that has 3 or 4 or more output nodes that then outputs y hat. So I'm going to show you a neural network with no hidden layer, and all it does is compute z1 equals w1 times the input x plus b, and then the output, a1, or y hat, is just the softmax activation function applied to z1. So in this neural network with no hidden layer, it should give you a sense of the types of things a softmax function can represent. So here's one example with just raw inputs x1 and x2, a softmax layer with c equals 3 output classes can represent this type of decision boundary. Now this is kind of a several linear decision boundaries, but this allows it to separate out the data into 3 classes. And in this diagram, what we did was we actually took the training set that's kind of shown in this figure, and trained the softmax classifier with 3 output labels on the data, and then the color on this plot shows Thesh holding the outputs of the softmax classifier, and coloring in the input based on which one of the 3 outputs had the highest probability. So we can maybe kind of see that this is like a generalization of logistic regression with sort of linear decision boundaries, but with more than 2 classes, where instead of the classes being just 0 or 1, the classes can be 0, 1, or 2. Here's another example of decision boundary that a softmax classifier can represent when trained on a data set with 3 classes. And here's another one, right? So this is, but one intuition is that the decision boundary between any 2 classes will be linear. That's why you see, for example, the decision boundary between the yellow and the red classes, there's sort of a linear boundary between the purple and red, there's a linear decision boundary between the purple and yellow, there's another linear decision boundary that is able to use these different linear functions in order to separate the space into 3 classes. Let's look at some examples with more classes. So this example with c equals 4, so that the green class in softmax can continue to represent these types of linear decision boundaries between multiple classes. So here's one more example with c equals 5 classes, and here's one last example with c equals 6. So this shows the type of things a softmax classifier can do when there is no hidden layer. Of course, you have a much deeper neural network with x, and then some hidden units, and then more hidden units, and so on. Then you can learn even more complex nonlinear decision boundaries to separate out multiple different classes. So I hope this gives you a sense of what a softmax layer, or the softmax activation function in a neural network can do. In the next video, let's take a look at how you can train a neural network that uses a softmax layer.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Hyperparameter Tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tuning

Tuning Process
Video
・
7 mins

Using an Appropriate Scale to pick Hyperparameters
Video
・
8 mins

Hyperparameters Tuning in Practice: Pandas vs. Caviar
Video
・
6 mins

Batch Normalization

Clarification about Upcoming Normalizing Activations in a Network Video
Reading
・
1 min

Normalizing Activations in a Network
Video
・
8 mins

Fitting Batch Norm into a Neural Network
Video
・
12 mins

Why does Batch Norm work?
Video
・
11 mins

Batch Norm at Test Time
Video
・
5 mins

Multi-class Classification

Clarifications about Upcoming Softmax Video
Reading
・
1 min

Softmax Regression
Video
・
11 mins

Training a Softmax Classifier
Video
・
10 mins

Introduction to Programming Frameworks

Deep Learning Frameworks
Video
・
4 mins

TensorFlow
Video
・
15 mins

(Optional) Learn about Gradient Tape and More
Reading
・
1 min

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Hyperparameter tuning, Batch Normalization, Programming Frameworks

Graded・Quiz

・

50 mins

End of access to Lab Notebooks

[IMPORTANT] Reminder about end of access to Lab Notebooks
Reading
・
2 mins

Programming Assignment

TensorFlow Introduction

Graded・Code Assignment

・

3 hours

References & Acknowledgments

References
Reading
・
10 mins

Acknowledgments
Reading
・
10 mins

Next in the Professional Certificate

Structuring Machine Learning Projects