Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Let's take a look at the details of what the TensorFlow code for training a neural network is actually doing. Let's dive in. Before looking at the details of training a neural network, let's recall how you had trained a logistic regression model in the previous course. Step 1 of building a logistic regression model was, you would specify how to compute the output given the input vgelec and the parameters w and b. In the first course, we said the logistic regression function predicts f of x is equal to g, the sigmoid function applied to w dot product x plus b, which was the sigmoid function applied to w dot x plus b. So, if z is the dot product of w of x plus b, then f of x is 1 over 1 plus e to the negative z. So that was the first step, where to specify what is the input-to-output function of logistic regression, and that depends on both the input x and the parameters of the model. The second step we had to do to train the logistic regression model was to specify the loss function and also the cost function. So you may recall that the loss function said if logistic regression outputs f of x and the ground truth label, the actual label in the training set was y, then the loss on that single training example was negative y log f of x minus 1 minus y times log of 1 minus f of x. So, this was a measure of how well is logistic regression doing on a single training example x comma y. Given this definition of a loss function, we then define the cost function, and the cost function was a function of the parameters w and b. And that was just the average, that is taking an average over all m training examples of the loss function computed on the m training examples x1, y1 through xm, ym. And remember that in the convention we're using, the loss function is a function of the output of the learning algorithm and the ground truth label as computed over a single training example, whereas the cost function j is an average of the loss function computed over your entire training set. So that was step two of what we did when building up logistic regression. And then the third and final step to train logistic regression model was to use an algorithm, specifically gradient descent, to minimize that cost function j of w, b, to minimize it as a function of the parameters w and b. And we minimize the cost j as a function of the parameters using gradient descent, where w is updated as w minus the learning rate alpha times the derivative of j with respect to w, and b similarly is updated as b minus the learning rate alpha times the derivative of j with respect to b. So with these three steps, step one, specify how to compute the outputs given the input x and parameters, step two, specify the loss and cost, and step three, minimize the cost function, we trained logistic regression. The same three steps is how we can train a neural network in TensorFlow. Now let's look at how these three steps map to training a neural network. We'll go over this in greater detail on the next three slides, but really briefly. Step one of specifying how to compute the output given the input x and parameters w and b, that's done with this code snippet, which should be familiar from last week of specifying the neural network. And this was actually enough to specify the computations needed in forward propagation or for the inference algorithm, for example. The second step is to compile the model and to tell it what loss you want to use. And here's the code that you use to specify this loss function, which is the binary cross-entropy loss function. And once you specify this loss, taking an average over the entire training set also gives you the cost function for the neural network. And then step three is to call a function to try to minimize the cost as a function of the parameters of the neural network. Let's look in greater detail in these three steps in the context of training a neural network. The first step, specify how to compute the output given the input x and parameters w and b. This code snippet specifies the entire architecture of the neural network. It tells you that there are 25 hidden units in the first hidden layer, then 15 in the next one, and then one output unit, and that we're using the sigmoid activation value. And so based on this code snippet, we know also what are the parameters, w1, b1 of the first layer, parameters of the second layer, and parameters of the third layer. So this code snippet specifies the entire architecture of the neural network and therefore tells TensorFlow everything it needs in order to compute the output. The output a3 or f of x as a function of the input x and the parameters. Here we have written wl and bl. Let's go on to step two. In the second step, you have to specify what is the loss function, and that will also define the cost function we use to train the neural network. So for the handwritten digit classification problem where images are either of a 0 or a 1. And the most common by far loss function to use is this one. It's actually the same loss function as what we had for logistic regression. It's negative y log f of x minus 1 minus y times log 1 minus f of x. Where y is the ground truth label, sometimes also called the target label y, and f of x is now the output of the neural network. And in TensorFlow, this is called the binary cross-entropy loss function. Where does that name come from? Well, it turns out in statistics, this function on top is called the cross-entropy loss function. So that's what cross-entropy means. And the word binary just re-emphasizes or points out that this is a binary classification problem because each image is either a 0 or a 1. And the syntax is to ask TensorFlow to compile the neural network using this loss function. And another historical note, Keras was originally a library that had developed independently of TensorFlow. It was actually a totally separate project from TensorFlow. But eventually it got merged into TensorFlow, which is why we have tf.keraslibrary.losses.the name of this loss function. And by the way, I don't always remember the names of all the loss functions in TensorFlow. But I just do a quick web search myself to find the right name, and then I plug that into my code. Having specified the loss with respect to a single training example, TensorFlow knows that the cost you want to minimize is then the average, taking the average over all m training examples, of the loss on all of the training examples. And optimizing this cost function will result in fitting the neural network to your binary classification data. In case you want to solve a regression problem rather than a classification problem, you can also tell TensorFlow to compile your model using a different loss function. For example, if you have a regression problem, and if you want to minimize the squared error loss, so here is the squared error loss, the loss with respect to if your learning algorithm outputs f of x, with a target or ground truth label of y, that's one half of the squared error, then you can use this loss function in TensorFlow, which is to use the maybe more intuitively named mean squared error loss function. And then TensorFlow will try to minimize the mean squared error. In this expression, I'm using J of capital W comma capital B to denote the cost function. The cost function is a function of all of the parameters in the neural network. So you can think of capital W as including W1, W2, W3, so all the W parameters in the entire neural network, and B as including B1, B2, and B3. So if you are optimizing the cost function with respect to W and B, you'd be trying to optimize it with respect to all of the parameters in the neural network. And up on top as well, I have written f of x as the output of the neural network, but if you want, you can also write f of WB if you want to emphasize that the output of the neural network as a function of x depends on all the parameters and all the layers of the neural network. So that's the loss function and the cost function. Finally, you will ask TensorFlow to minimize the cost function. You might remember the gradient descent algorithm from the first course. If you are using gradient descent to train the parameters of a neural network, then you will repeatedly, for every layer L and for every unit J, update WLJ according to WLJ minus the learning rate alpha times the partial derivative with respect to that parameter of the cost function, J of WB, and similarly for the parameters B as well. And after doing, say, 100 iterations of gradient descent, hopefully you get to a good value of the parameters. So in order to use gradient descent, the key thing you need to compute is these partial derivative terms. And what TensorFlow does, and in fact what is standard in neural network training, is to use an algorithm called backpropagation in order to compute these partial derivative terms. TensorFlow can do all of these things for you. It implements backpropagation all within this function called fit. So all you have to do is call model.fit x, y as your training set and tell it to do so for 100 iterations or 100 epochs. In fact, what you see later is that TensorFlow can use an algorithm that is even a little bit faster than gradient descent. And you see more about that later this week as well. Now, I know that we're relying heavily on the TensorFlow library in order to implement a neural network. One pattern I've seen across multiple ideas is, as the technology evolves, libraries become more mature and most engineers will use libraries rather than implement code from scratch. And there have been many other examples of this in the history of computing. Once, many, many decades ago, programmers had to implement their own sorting function from scratch. But now, sorting libraries are quite mature, that you probably call someone else's sorting function rather than implement it yourself, unless you're taking a computing class that asks you to do it as an exercise. And today, if you want to compute the square root of a number, like, what is the square root of 7? Well, once, programmers had to write their own code to compute this, but now pretty much everyone just calls a library to take square roots or matrix operations, such as multiplying two matrices together. So when deep learning was younger and less mature, many developers, including me, were implementing things from scratch, using Python or C++ or some other library. But today, deep learning libraries have matured enough that most developers will use these libraries. And in fact, most commercial implementations of neural networks today use a library like TensorFlow or PyTorch. But as I've mentioned, it's still useful to understand how they work under the hood so that if something unexpected happens, which still does with today's libraries, you have a better chance of knowing how to fix it. Now that you know how to train a basic neural network, also called a multilayer perceptron, there are some things you can change about the neural network that will make it even more powerful. In the next video, let's take a look at how you can swap in different activation functions as an alternative to the sigmoid activation function we've been using. This will make your neural networks work even much better. So let's go take a look at that in the next video.