In the last video, you learned about the softmax layer and the softmax activation function. In this video, you deepen your understanding of softmax classification, and also learn how to train a model that uses a softmax layer. Recall our earlier example where the output layer computes zl as follows. So if you have four classes, c equals 4, then zl can be 4 by 1 dimensional vector, and we said we compute t, which is this temporary variable that performs element-wise exponentiation. And then finally, if the activation function for your output layer, g of l, is the softmax activation function, then the output will be this. It's basically taking that temporary variable, t, and normalizing it to sum to 1. So this then becomes a of l. So you notice that in the z vector, the biggest element was 5, and the biggest probability ends up being this first probability. The name softmax comes from contrasting it to what's called a hardmax, which would have taken the vector z and mapped it to this vector. So a hardmax function will look at the elements of z and just put a 1 in the position of the biggest element of z, and then 0s everywhere else. And so this is a very hardmax, where the biggest element gets an output of 1, and everything else gets an output of 0. Whereas in contrast, the softmax is a more gentle mapping from z to these probabilities. So I'm not sure if this is a great name, but at least that was the intuition behind why we call it a softmax. It's in contrast to the hardmax. And one thing I didn't really show, but I alluded to, is that softmax regression, or the softmax activation function, generalizes the logistic activation function to c clauses rather than just two clauses. And it turns out that if c is equal to 2, then softmax with c equals to 2 essentially reduces to logistic regression. And I'm not going to prove this in this video, but the rough outline for the proof is that if c equals to 2, and if you apply softmax, then the output layer AL will output two numbers of c equals 2. So maybe it outputs 0.842 and 0.158. These two numbers always have to sum to 1. And because these two numbers always have to sum to 1, they're actually redundant, and maybe you don't need to bother to compute two of them. Maybe you just need to compute one of them. And it turns out that the way you end up computing that number reduces to the way that logistic regression is computing this single output. So that wasn't much of a proof, but the takeaway from this is that softmax regression is a generalization of logistic regression to more than two clauses. Now, let's look at how you would actually train a neural network with a softmax output layer. So in particular, let's define the loss function you use to train your neural network. Let's take an example. Let's say you have an example in your training set where the output is, where the target output, the ground truth label is 0100. So the example from the previous video, this means that this is an image of a cat because it falls into clause one. And now, let's say that your neural network is currently outputting y-hat equals, so y-hat would be a vector of probabilities, goes sum to 1, 0.1, 0.4. So you can check that sums to 1 and this is going to be a L. So the neural network is not doing very well in this example because this is actually a cat and assigned only a 20% chance that this is a cat. So it didn't do very well in this example. So what's the loss function you would want to use to train this neural network? In softmax classification, the loss we typically use is negative sum of j equals 1 through 4, and it's really sum from 1 to c in the general case, and I'll just use 4 here, of y log or yj log y-hat of j. So let's look at our single example above to better understand what happens. Notice that in this example, y1 equals y3 equals y4 equals 0 because those are 0s and only y2 is equal to 1. So if you look at this summation, all the terms with 0 values of yj will equal to 0, and the only term you're left with is negative y2 log y-hat 2 because when you sum over the indices of j, all the terms will end up 0 except when j is equal to 2. And because y2 is equal to 1, this is just negative log y-hat 2. So what this means is that if your learning algorithm is trying to make this small, because you use gradient descent to try to reduce the loss on your training set, then the only way to make this small is to make this small, and the only way to do that is to make y-hat 2 as big as possible. And these are probabilities, so it can never be bigger than 1, but this kind of makes sense because if x, for this example, is a picture of a cat, then you want that output probability to be as big as possible. So more generally, what this loss function does is it looks at whatever is the ground truth clause in your training set, and it tries to make the corresponding probability of that clause as high as possible. If you're familiar with maximum likelihood estimation statistics, this turns out to be a form of maximum likelihood estimation. But if you don't know what that means, don't worry about it. The intuition we just talked about will suffice. Now, this is the loss on a single training example. How about the cost, J, on the entire training set? So the cost of assessing the parameters, you know, and so on, of all the ways and biases, you define that as pretty much what you guess, sum over your entire training set of the loss, your learning algorithms, predictions, summed over your training examples. And so what you do is use gradient descent in order to try to minimize this cost. Finally, one more implementation detail. Notice that because C is equal to 4, Y is a 4 by 1 vector, and Y hat is also a 4 by 1 vector. So if you're using a vectorized implementation, the matrix capital Y is going to be Y1, Y2, through Ym, stacked horizontally. And so, for example, if this example up here is your first training example, then the first column of this matrix Y will be 0, 1, 0, 0, and then your second example, maybe the second example is a dog, maybe the third example is none of the above, and so on. And then this matrix capital Y will end up being a 4 by m dimensional matrix. And similarly, Y hat will be Y hat 1, stacked up horizontally, going through Y hat m. So if this is actually Y hat 1, or the output on the first training example, then Y hat will be this 0.3, 0.2, 0.1, 0.4, and so on. And Y hat itself will also be a 4 by m dimensional matrix. Finally, let's take a look at how you'd implement gradient descent when you have a softmax output layer. So this output layer will compute ZL, which is C by 1 in our example, 4 by 1, and then you apply the softmax activation function to get AL, or Y hat, and then that in turn allows you to compute the loss. So we've talked about how to implement the forward propagation step of the neural network to get this output and to compute that loss. How about the back propagation step, or gradient descent? It turns out that the key step, or the key equation you need to initialize backprop is this expression that the derivative with respect to Z at the last layer, this turns out, you can compute this Y hat, the 4 by 1 vector, minus Y, the 4 by 1 vector. So you notice that all of these are going to be 4 by 1 vectors when you have 4 classes, and C by 1 in the more general case. And so this, going by our usual definition of what is DZ, this is a partial derivative of the cost function with respect to ZL. If you're an expert in calculus, you can derive this yourself, or if you're an expert in calculus, you can try to derive this yourself, but using this formula will also just work fine if you ever need to implement this from scratch. But with this, you can then compute DZL and then sort of start off the backprop process to compute all the derivatives you need throughout your neural network. But it turns out that in this week's programming exercise, we'll start to use one of the deep learning programming frameworks, and for those programming frameworks, usually it turns out you just need to focus on getting the forward prop right, and so long as you specify the programming framework, the forward prop path, the programming framework will figure out how to do backprop, how to do the backward path for you. So this expression is worth keeping in mind for if you ever need to implement softmax regression or softmax classification from scratch, although you won't actually need this in this week's programming exercise because the programming framework you use will take care of this derivative computation for you. So that's it for softmax classification. With it, you can now implement learning algorithms to categorize the inputs into not just one of two classes, but one of C different classes. Next, I want to show you some of the deep learning programming frameworks which can make you much more efficient in terms of implementing deep learning algorithms. Let's go on to the next video to discuss that.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Hyperparameter Tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tuning

Tuning Process
Video
・
7 mins

Using an Appropriate Scale to pick Hyperparameters
Video
・
8 mins

Hyperparameters Tuning in Practice: Pandas vs. Caviar
Video
・
6 mins

Batch Normalization

Clarification about Upcoming Normalizing Activations in a Network Video
Reading
・
1 min

Normalizing Activations in a Network
Video
・
8 mins

Fitting Batch Norm into a Neural Network
Video
・
12 mins

Why does Batch Norm work?
Video
・
11 mins

Batch Norm at Test Time
Video
・
5 mins

Multi-class Classification

Clarifications about Upcoming Softmax Video
Reading
・
1 min

Softmax Regression
Video
・
11 mins

Training a Softmax Classifier
Video
・
10 mins

Introduction to Programming Frameworks

Deep Learning Frameworks
Video
・
4 mins

TensorFlow
Video
・
15 mins

(Optional) Learn about Gradient Tape and More
Reading
・
1 min

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Hyperparameter tuning, Batch Normalization, Programming Frameworks

Graded・Quiz

・

50 mins

End of access to Lab Notebooks

[IMPORTANT] Reminder about end of access to Lab Notebooks
Reading
・
2 mins

Programming Assignment

TensorFlow Introduction

Graded・Code Assignment

・

3 hours

References & Acknowledgments

References
Reading
・
10 mins

Acknowledgments
Reading
・
10 mins

Next in the Professional Certificate

Structuring Machine Learning Projects