Welcome to this course on the practical aspects of deep learning. Perhaps now you've learned how to implement a neural network. In this week, you'll learn the practical aspects of how to make your neural network work well, ranging from things like hyperparameter tuning, to how to set up your data, to how to make sure your optimization algorithm runs quickly, so that you get your learning algorithm to learn in a reasonable amount of time. In this first week, we'll first talk about how to set up your machine learning problem, then we'll talk about regularization, and we'll talk about some tricks for making sure your neural network implementation is correct. With that, let's get started. Making good choices in how you set up your training, development, and test sets can make a huge difference in helping you quickly find a good, high-performance neural network. When training a neural network, you have to make a lot of decisions, such as how many layers will your neural network have, and how many hidden units do you want each layer to have, and what's the learning rate, and what are the activation functions you want to use for the different layers. When you're starting on a new application, it's almost impossible to correctly guess the right values for all of these, and for other hyperparameter choices on your first attempt. So in practice, applying machine learning is a highly iterative process in which you often start with an idea, such as you want to build a neural network of a certain number of layers, a certain number of hidden units, maybe on certain data sets, and so on, and then you just have to code it up and try it. By running your code, you run and experiment, and get back a result that tells you how well this particular network or this particular configuration works, and based on the outcome, you might then refine your ideas and change your choices, and maybe keep iterating in order to try to find a better and a better neural network. Today, deep learning has found great success in a lot of areas, ranging from natural language processing, to computer vision, to speech recognition, to a lot of applications on also structured data. And structured data includes everything from advertisement to web search, which isn't just internet search engines, it's also, for example, shopping websites, or really any website that wants to deliver great search results when you enter terms into a search bar, to computer security, to logistics, such as figuring out where to send drivers to pick up and drop off things, to many more. So what I've seen is that sometimes a researcher with a lot of experience in NLP might enter, might try to do something in computer vision, or maybe a researcher with a lot of experience in speech recognition might jump in and try to do something on advertising, or someone from security might want to jump in and do something on logistics. And what I've seen is that intuitions from one domain or from one application area often do not transfer to other application areas. And the best choices may depend on the amount of data you have, the number of input features you have, your computer configuration, and whether you're training on GPUs or CPUs, and so exactly what configuration of GPUs and CPUs, and many other things. So for a lot of applications, I think it's almost impossible. Even very experienced deep learning people find it almost impossible to correctly guess the best choice of hyperparameters the very first time. And so today, applied deep learning is a very iterative process where you just have to go around this cycle many times to hopefully find a good choice of network for your application. So one of the things that will determine how quickly you can make progress is how efficiently you can go around this cycle. And setting up your data sets well in terms of your train, development, and test sets can make you much more efficient at that. So if this is your training data, let's draw that as a big box, then traditionally you might take all the data you have and carve off some portion of it to be your training set, some portion of it to be your holdout cross-validation set. And this is sometimes also called the development set. And for brevity, I'm just going to call this the dev set, but all of these terms mean roughly the same thing. And then you might carve out some final portion of it to be your test set. And so the workflow is that you keep on training algorithms on your training set and use your dev set or your holdout cross-validation set to see which of many different models performs best on your dev set. And then after having done this long enough, when you have a final model that you want to evaluate, you can take the best model you have found and evaluate it on your test set in order to get an unbiased estimate of how well your algorithm is doing. So in the previous era of machine learning, it was common practice to take all your data and split it according to maybe a 70-30 percent in terms of people often talk about the 70-30 train test split, if you don't have an explicit dev set, or maybe a 60-20-20 percent split in terms of 60 percent train, 20 percent dev, and 20 percent test. And several years ago, this was widely considered best practice in machine learning. If you have maybe 100 examples in total, maybe 1,000 examples in total, maybe up to 10,000 examples, these sorts of ratios were perfectly reasonable rules of thumb. But in the modern big data era, where, for example, you might have a million examples in total, then the trend is that your dev and test sets have been becoming a much smaller percentage of the total. Because remember, the goal of the dev set, the development set, is that you're going to test different algorithms on it and see which algorithm works better. So the dev set just needs to be big enough for you to evaluate, say, two different algorithm choices or 10 different algorithm choices and quickly decide which one is doing better. And you might not need a whole 20 percent of your data for that. So, for example, if you have a million training examples, you might decide that just having 10,000 examples in your dev set is more than enough to evaluate which one of two algorithms does better. And in a similar vein, the main goal of your test set is giving your final classifier to give you a pretty confident estimate of how well it's doing. And again, if you have a million examples, maybe you might decide that 10,000 examples is more than enough in order to evaluate a single classifier and give you a good estimate of how well it's doing. So in this example, where you have a million examples, if you need just 10,000 for your dev and 10,000 for your test, your ratio would be more like 10,000 is 1 percent of 1 million, so you'd have 98 percent train, 1 percent dev, 1 percent test. And I've also seen applications where if you have even more than a million examples, you might end up with 99.5 percent train and 0.25 percent dev, 0.25 percent test, or maybe a 0.4 percent dev, 0.1 percent test. So just to recap, when setting up your machine learning problem, I'll often set it up into a train, dev, and test sets. And if you have a relatively small data set, these traditional ratios might be okay. But if you have a much larger data set, it's also fine to set your dev and test sets to be much smaller than 20 percent or even 10 percent of your data. We'll give more specific guidelines on the sizes of dev and test sets later in this specialization. One other trend we're seeing in the era of modern deep learning is that more and more people train on mismatched train and test distributions. Let's say you're building an app that lets users upload a lot of pictures, and your goal is to find pictures of cats in order to show your users. Maybe all your users are cat lovers. Maybe your training set comes from cat pictures downloaded off the Internet, but your dev and test sets might comprise cat pictures from users using your app. So maybe your training set has a lot of pictures crawled off the Internet, but the dev and test sets are pictures uploaded by users. It turns out a lot of web pages have very high resolution, very professional, very nicely framed pictures of cats, but maybe your users are uploading, you know, blurrier, lower res images just taken with a cell phone camera in a more casual condition. And so these two distributions of data may be different. The rule of thumb I'd encourage you to follow in this case is to make sure that the dev and test sets come from the same distribution. We'll say more about this particular guideline as well, but because you will be using the dev set to evaluate a lot of different models and trying really hard to improve performance on the dev set, it's nice if your dev set comes from the same distribution as your test set. But because deep learning algorithms have such a huge hunger for training data, one trend I'm seeing is that you might use all sorts of creative tactics such as crawling web pages in order to acquire a much bigger training set than you would otherwise have. Even if part of the cost of that is then that your training set data might not come from the same distribution as your dev and test set. But you find that so long as you follow this rule of thumb, that progress in your machine learning algorithm will be faster. And I'll give a more detailed explanation for this particular rule of thumb later in this specialization as well. Finally, it might be okay to not have a test set. Remember the goal of the test set is to give you an unbiased estimate of the performance of your final network, of the network that you selected. But if you don't need that unbiased estimate, then it might be okay to not have a test set. So what you do if you have only a dev set but not a test set is you train on the training set and then you try different model architectures, evaluate them on the dev set, and then use that to iterate and try to get to a good model. Because you've fit your data to the dev set, this no longer gives you an unbiased estimate of performance. But if you don't need one, that might be perfectly fine. In the machine learning world, when you have just a train and a dev set but no separate test set, most people will call this a training set and they will call the dev set the test set. But what they actually end up doing is using the test set as a holdout cross-validation set, which maybe isn't completely a great use of terminology because they're then overfitting to the test set. So when a team tells you that they have only a train and a test set, I would just be cautious and think, do they really have a train dev set because they're overfitting to the test set? Culturally, it might be difficult to change some of these teams' terminology and get them to call it a train dev set rather than a train test set, even though I think calling it a train and development set would be more correct terminology. And this is actually okay practice if you don't need a completely unbiased estimate of the performance of your algorithm. So having set up a train, dev, and test set will allow you to iterate more quickly. It will also allow you to more efficiently measure the bias and variance of your algorithm so you can more efficiently select ways to improve your algorithm. Let's start to talk about that in the next video.

Week 1: Practical Aspects of Deep Learning

Setting up your Machine Learning Application

Train / Dev / Test sets
Video
・
12 mins

Bias / Variance
Video
・
8 mins

Basic Recipe for Machine Learning
Video
・
6 mins

Regularizing your Neural Network

Clarification about Upcoming Regularization Video
Reading
・
1 min

Regularization
Video
・
9 mins

Why Regularization Reduces Overfitting?
Video
・
7 mins

Dropout Regularization
Video
・
9 mins

Clarification about Upcoming Understanding Dropout Video
Reading
・
1 min

Understanding Dropout
Video
・
7 mins

Other Regularization Methods
Video
・
8 mins

Setting Up your Optimization Problem

Normalizing Inputs
Video
・
5 mins

Vanishing / Exploding Gradients
Video
・
6 mins

Weight Initialization for Deep Networks
Video
・
6 mins

Numerical Approximation of Gradients
Video
・
6 mins

Gradient Checking
Video
・
6 mins

Gradient Checking Implementation Notes
Video
・
5 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Quiz

Practical Aspects of Deep Learning

Graded・Quiz

・

50 mins

Programming Assignments

(Optional) Downloading your Notebook, Downloading your Workspace and Refreshing your Workspace
Reading
・
5 mins

Initialization

Graded・Code Assignment

・

3 hours

Regularization

Graded・Code Assignment

・

3 hours

Gradient Checking

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yoshua Bengio Interview
Video
・
25 mins

Week 2: Optimization Algorithms