Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
We're seeing that high bias or high variance are both bad, in the sense that they hurt the performance of your algorithm. One of the reasons that neural networks have been so successful is because neural networks, together with the idea of big data, or hopefully having large data sets, has given us new ways to address both high bias and high variance. Let's take a look. You saw that if you're fitting different order polynomials to a data set, then if you were to fit a linear model like this on the left, you have a pretty simple model that can have high bias, whereas if you were to fit a complex model, then you might suffer from high variance, and there's this trade-off between bias and variance. In our example, it was choosing a second-order polynomial that helps you make a trade-off and pick a model with the lowest possible cross-validation error. Before the days of neural networks, machine learning engineers talked a lot about this bias-variance trade-off, in which you had to balance the complexity, that is, the degree of polynomial or the regularization parameter lambda, to make bias and variance both not be too high. If you hear machine learning engineers talk about the bias-variance trade-off, this is what they're referring to, where if you have too simple a model, you have high bias, too complex a model, high variance, and you have to find a trade-off between these two bad things to find hopefully the best possible outcome. But it turns out that neural networks offer us a way out of this dilemma of having a trade-off bias and variance, with some caveats. It turns out that large neural networks, when trained on small-to-moderate-sized datasets, are low-bias machines. What I mean by that is if you make your neural network large enough, you can almost always fit your training set well, so long as your training set is not enormous. What this means is this gives us a new recipe to try to reduce bias or reduce variance as needed, without needing to really trade-off between the two of them. Let me share with you a simple recipe that isn't always applicable, but if it applies, can be very powerful for getting an accurate model using a neural network, which is first train your algorithm on your training set, and then ask, does it do well on the training set? You measure J-train and see if it is high. By high, I mean, for example, relative to human-level performance or some baseline-level performance. If it is not doing well, then you have a high-bias problem, high training set error. One way to reduce bias is to just use a bigger neural network. By bigger neural network, I mean either more hidden layers or more hidden units per layer, and you can then keep on going through this loop and make your neural network bigger and bigger until it does well on the training set, meaning it achieves a level of error on your training set that is roughly comparable to the target level of error you hope to get to, which could be human-level performance. After it does well on the training set, so the answer to that question is yes, you would then ask, does it do well on the cross-validation set? In other words, does it have high variance? If the answer is no, then you can conclude that the algorithm has high variance because it does well on the training set, does not do well on the cross-validation set, so that big gap in J-CV and J-train indicates you probably have a high-variance problem. If you have a high-variance problem, then one way to try to fix it is to get more data. So you get more data and go back and retrain the model and just double-check, does it do well on the training set? If not, have a bigger network, or if it does, see if it does well on the cross-validation set, and if not, get more data. If you can keep on going round and round and round this loop until eventually it does well on the cross-validation set, then you're probably done because now you have a model that does well on the cross-validation set and hopefully will also generalize to new examples as well. Now, of course, there are limitations to the application of this recipe. Training a bigger neural network does reduce bias, but at some point it does get computationally expensive. That's why the rise of neural networks has been really assisted by the rise of very fast computers, including especially GPUs or graphics processor units, hardware traditionally used to speed up computer graphics, but that turns out has been very useful for speeding up neural networks as well. But even with hardware accelerators, beyond a certain point, the neural networks are so large and take so long to train, it becomes infeasible. And then, of course, the other limitation is more data. Sometimes you can only get so much data, and beyond a certain point, it's hard to get much more data. But I think this recipe explains a lot of the rise of deep learning in the last several years, which is for applications where you do have access to a lot of data, then being able to train large neural networks allows you to eventually get pretty good performance on a lot of applications. One thing that was implicit in this slide that may not have been obvious is that as you're developing a learning algorithm, sometimes you find that you have high bias, in which case you do things like increase the neural network. But then after you increase the neural network, you may find that you have high variance, in which case you might do other things like collect more data. And during the hours or days or weeks you're developing a machine learning algorithm, at different points you may have high bias or high variance, and it can change. But it's depending on whether your algorithm has high bias or high variance at that time that that can help give guidance for what you should be trying next. When you're training a neural network, one thing that people have asked me before is, hey, Andrew, what if my neural network is too big? Will that create a high variance problem? It turns out that a large neural network with well-chosen regularization will usually do as well or better than a smaller one. For example, if you have a small neural network like this, and you were to switch to a much larger neural network like this, you would think that the risk of overfitting goes up significantly. But it turns out that if you were to regularize this larger neural network appropriately, then this larger neural network usually will do at least as well or better than the smaller one, so long as the regularization is chosen appropriately. Another way of saying this is that it almost never hurts to go to a larger neural network so long as you regularize appropriately. We have one caveat, which is that when you train a larger neural network, it does become more computationally expensive. The main way it hurts is it will slow down your training and your inference process. Very briefly, to regularize a neural network, this is what you do. If the cost function for your neural network is the average loss, and so the loss here could be squared error or logistic loss, then the regularization term for a neural network looks like pretty much what you expect, which is lambda over 2m times the sum of w squared, where this is the sum over all weights w in the neural network. Similar to regularization for linear regression and logistic regression, we usually don't regularize the parenthesis b in a neural network, although in practice it makes very little difference whether you do so or not. The way you would implement regularization in TensorFlow is, recall that this was the code for implementing an unregularized handwritten digit classification model. We create three layers like so with number of hidden units, activation, and then create a sequential model with the three layers. If you want to add regularization, then you would just add this extra term, kernel regularizer equals L2 and then 0.01, where that's the value of lambda. TensorFlow actually lets you choose different values of lambda for different layers, although for simplicity, you can choose the same value of lambda for all the weights in all of the different layers as follows, and then this will allow you to implement regularization in your neural network. To summarize, two takeaways I hope you have from this video are, one, it hardly ever hurts to have a larger neural network so long as you regularize appropriately. One caveat being that having a larger neural network can slow down your algorithm, so maybe that's the one way it hurts, but it shouldn't hurt your algorithm's performance for the most part. In fact, it could even help it significantly. Second, so long as your training set isn't too large, then a neural network, especially a large neural network, is often a low-bias machine. It just fits very complicated functions very well, which is why when I'm training neural networks, I find that I'm often fighting variance problems rather than bias problems, at least if the neural network is large enough. So the rise of deep learning has really changed the way that machine learning practitioners think about bias and variance. Having said that, even when you're training a neural network, measuring bias and variance and using that to guide what you do next is often a very helpful thing to do. So that's it for bias and variance. Let's go on to the next video where we'll take all the ideas we've learned and see how they fit in to the development process of machine learning systems. I hope that will tie a lot of these pieces together to give you practical advice on how to quickly move forward into development of your machine learning systems.