We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
When you build your neural network, one of the choices you get to make is what activation functions you use in the hidden layers as well as at the output units of your neural network. So far, we've just been using the sigmoid activation function, but sometimes other choices can work much better. Let's take a look at some of the options. In the forward propagation steps for the neural network, we had these two steps where we used the sigmoid function here. So that sigmoid is called an activation function. And here's the familiar sigmoid function, a equals 1 over 1 plus e to the negative z. So in the more general case, we can have a different function, g of z, which we'll write here, where g could be a nonlinear function that may not be the sigmoid function. So for example, the sigmoid function goes between 0 and 1. An activation function that almost always works better than the sigmoid function is the tanh function or the hyperbolic tangent function. So this is z, this is a, this is a equals tanh of z, and this goes between plus 1 and minus 1. The formula for the tanh function is e to the z minus e to the negative z over their sum. And it's actually mathematically a shifted version of the sigmoid function. So as a, you know, sigmoid function just like that, but shifted so that it now crosses a 0, 0 point and rescales, so it goes between minus 1 and plus 1. And it turns out that for hidden units, if you let the function g of z be equal to tanh of z, this almost always works better than the sigmoid function. Because with values between plus 1 and minus 1, the mean of the activations that come out of your hidden layer are closer to having a 0 mean. And so just as sometimes when you train a learning algorithm, you might center the data and have your data have 0 mean, using a tanh instead of a sigmoid function kind of has the effect of centering your data so that the mean of your data is closer to 0 rather than maybe 0.5. And this actually makes learning for the next layer a little bit easier. We'll say more about this in the second course when we talk about optimization algorithms as well. But one takeaway is that I pretty much never use the sigmoid activation function anymore. The tanh function is almost always strictly superior. The one exception is for the output layer, because if y is either 0 or 1, then it makes sense for y hat to be a number that you want to output that's between 0 and 1, rather than between minus 1 and 1. So the one exception where I would use the sigmoid activation function is when you're using binary classification, in which case you might use the sigmoid activation function for the output layer. So g of z2 here is equal to sigma of z2. And so what you see in this example is where you might have a tanh activation function for the hidden layer and a sigmoid for the output layer. So the activation functions can be different for different layers. And sometimes to denote that the activation functions are different for different layers, we might use these square bracket superscripts as well, to indicate that g of square bracket 1 may be different than g of square bracket 2. And again, square bracket 1 superscript refers to this layer, and superscript square bracket 2 refers to the output layer. Now, one of the downsides of both the sigmoid function and the tanh function is that if z is either very large or very small, then the gradient or the derivative or the slope of this function becomes very small. So if z is very large or z is very small, the slope of the function, you know, ends up being close to 0. And so this can slow down gradient descent. So one other choice that is very popular in machine learning is what's called the rectified linear unit. So the value function looks like this. And the formula is a equals max of 0, z. So the derivative is 1, so long as z is positive and the derivative or the slope is 0 when z is negative. If you're implementing this, technically, the derivative when z is exactly 0 is not well defined. But when you implement this in the computer, the odds that you get exactly z equals 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 is very small. So you don't need to worry about it. In practice, you could pretend the derivative when z is equal to 0, you can pretend is either 1 or 0 and your code will work just fine. So the fact is not differentiable. So here are some rules of thumb for choosing activation functions. If your output is 0, 1 value, if you're using binary classification, then the sigmoid activation function is very natural for the output layer. And then for all other units, ReLU or the rectified linear unit is increasingly the default choice of activation function. So if you're not sure what to use for your hidden layer, I would just use the ReLU activation function. It's what you see most people using these days, although sometimes people also use the tanh activation function. One disadvantage of the ReLU is that the derivative is equal to 0 when z is negative. In practice, this works just fine, but there is another version of the ReLU called the leaky ReLU. We'll give you the formula on the next slide, but instead of it being 0 when z is negative, it just takes a slight slope like so. So this is called the leaky ReLU. This usually works better than the ReLU activation function, although it's just not used as much in practice. Either one should be fine, although if you had to pick one, I usually just use the ReLU. And the advantage of both the ReLU and the leaky ReLU is that for a lot of the space of z, the derivative of the activation function, the slope of the activation function is very different from 0. And so in practice, using the ReLU activation function, your neural network will often learn much faster than using the tanh or the sigmoid activation function. And the main reason is that there's less of this effect of the slope of the function going to 0, which slows down learning. And I know that for half of the range of z, the slope for ReLU is 0. But in practice, enough of your hidden units will have z greater than 0, so learning can still be quite fast for most training examples. So let's just quickly recap the pros and cons of different activation functions. Here's the sigmoid activation function. I would say never use this except for the output layer if you are doing binary classification, or maybe almost never use this. And the reason I almost never use this is because the tanh is pretty much strictly superior. So the tanh activation function is this. And then the default, the most commonly used activation function is the ReLU, which is this. So if you're not sure what else to use, use this one. And maybe, you know, feel free also to try the leaky ReLU, where it might be 0.01 z comma z, right? So a is the max of 0.01 times z, and z. So that gives you this bend in the function. And you might say, you know, why is that constant 0.01? Well, you can also make that another parameter of the learning algorithm. And some people say that works even better, but I hardly see people do that. So, but if you feel like trying it in your application, you know, please feel free to do so. And you can just see how it works, and how well it works, and stick with it if it gives you a good result. So I hope that gives you a sense of some of the choices of activation functions you can use in your neural network. One of the themes we'll see in deep learning is that you often have a lot of different choices in how you build your neural network, ranging from number of hidden units, to the choice of activation function, to how you initialize the weights, which you'll see later. A lot of choices like that. And it turns out that it's sometimes difficult to get good guidelines for exactly what will work best for your problem. So throughout these three courses, I'll keep on giving you a sense of what I see in the industry, in terms of what's more or less popular. But for your application, with your application's idiosyncrasies, it's actually very difficult to know in advance exactly what will work best. So a common piece of advice would be, if you're not sure which one of these activation functions work best, you know, try them all, and evaluate on like a holdout validation set, or like a development set, which we'll talk about later, and see which one works better, and then go with that. And I think that by testing these different choices for your application, you'd be better at future-proofing your neural network architecture against the idiosyncrasies of your problem, as well as evolutions of the algorithms, rather than, you know, if I were to tell you, always use a ReLU activation, and don't use anything else, that just may or may not apply for whatever problem you end up working on, you know, either in the near future, or in the distant future. All right. So that was a choice of activation functions, and you see the most popular activation functions. There's one other question that sometimes you can ask, which is, why do you even need to use an activation function at all? Why not just do away with that? So let's talk about that in the next video, and where you see why neural networks do need some sort of nonlinear activation function.