Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
We've all been hearing that deep neural networks work really well for a lot of problems. It's not just that they need to be big neural networks, it's that specifically they need to be deep or to have a lot of hidden layers. So why is that? Let's go through a couple of examples and try to gain some intuition for why deep networks might work well. So first, what is a deep network computing? If you're building a system for face recognition or face detection, here's what a deep neural network could be doing. Perhaps you input a picture of a face, then the first layer of the neural network you can think of as maybe being a feature detector or an edge detector. In this example, I'm plotting what a neural network with maybe 20 hidden units might be trying to compute on this image, with the 20 hidden units visualized by these little square boxes. So, for example, this little visualization represents a hidden unit that's trying to figure out where the edges of that orientation are in the image. And maybe this hidden unit might be trying to figure out where are the horizontal edges in this image. And when we talk about convolutional networks in a later course, this particular visualization will make a bit more sense. But informally, you can think of the first layer of the neural network as looking at a picture and trying to figure out where are the edges in this picture. Now that it's figured out where are the edges in this picture, by grouping together pixels to form edges, it can then take the detected edges and group edges together to form parts of faces. So, for example, you might have a little neuron trying to see if it's finding an eye, or a different neuron trying to find that part of the nose. And so by putting together lots of edges, it can start to detect different parts of faces. And then finally, by putting together different parts of faces, like an eye, or a nose, or an ear, or a chin, it can then try to recognize or detect different types of faces. So intuitively, you can think of the earlier layers of a neural network as detecting simpler functions like edges, and then composing them together in the later layers of a neural network so that it can learn more and more complex functions. These visualizations will make more sense when we talk about convolutional nets. And one technical detail of this visualization, the edge detectors are looking in relatively small areas of an image, maybe very small regions like that. And then the facial detectors, you know, can look at maybe much larger areas of the image. But the main intuition when you take away from this is just finding simpler things like edges and then building them up, composing them together, to detect more complex things like an eye, or a nose, and then composing those together to find even more complex things. And this type of simple to complex hierarchical representation or compositional representation applies in other types of data than images and face recognition as well. For example, if you're trying to build a speech recognition system, it's hard to visualize speech, but if you input an audio clip, there may be the first level of a neural network might learn to detect, you know, low-level audio waveform features, such as is this tone going up, is it going down, is it white noise or sibilant sound like, right, and what is the pitch? They can detect low-level waveform features like that. And then by composing low-level waveforms, maybe you'll learn to detect basic units of sound. In linguistics, they're called phonemes, but for example, in the word cat, the cat is a phoneme, the ah is a phoneme, the ta is another phoneme, but learns to find maybe the basic units of sound. And then composing that together, maybe you learn to recognize words in the audio, and then maybe you can compose those together in order to recognize entire, you know, phrases or sentences. So deep neural network with multiple hidden layers might be able to have the earlier layers learn these low-level simpler features and then have the later deeper layers then put together the simpler things that's detected in order to detect more complex things, like recognize specific words or even phrases or sentences that you're uttering in order to carry out speech recognition. And what we see is that whereas the earlier layers are computing what seems like relatively simple functions of the input, such as where the edge is, by the time you get deep in the network, you can actually do, you know, surprisingly complex things, such as detect faces or detect words or phrases or sentences. Some people like to make an analogy between deep neural networks and the human brain, where we believe, or neuroscientists believe, that the human brain also starts off detecting simple things, like edges in what your eyes see, and then builds those up to detect more complex things, like the faces that you see. I think analogies between deep learning and the human brain are sometimes a little bit dangerous, but there is a lot of truth to this being how we think the human brain works, and that the human brain probably detects simple things, like edges first, and then puts them together to form more and more complex objects. And so that has served as a loose form of inspiration for some deep learning as well. We'll say a bit more about the human brain, or about the biological brain in a later video this week. The other piece of intuition about why deep networks seem to work well is the following. So this result comes from circuit theory, which pertains to thinking about what types of functions you can compute with different AND gates and OR gates and NOT gates, basically logic gates. So informally, the functions you compute with a relatively small but deep neural network, and by small I mean the number of hidden units is relatively small, but that if you try to compute the same function with a shallow network, so if you aren't allowed enough hidden layers, then you might require exponentially more hidden units to compute. So let me just give you one example and illustrate this a bit informally. But let's say you're trying to compute the exclusive OR, or the parity, of all your input features. So compute X1, X4, X2, X4, X3, XOR, up to XN, if you have N or NX features. So if you build an XOR tree like this, so first compute the XOR of X1 and X2, then take X3 and X4, and compute their XOR. And technically, if you're just using ANDs or a NOT gate, you might need a couple layers to compute the XOR function, rather than just one layer. But with a relatively small circuit, you can compute the XOR, right, and so on. And then you can, you know, build really an XOR tree like so, until eventually you have a circuit here that outputs, you know, well let's call this Y, that outputs Y hat equals Y, the exclusive OR or the parity of all of these input bits. So to compute the XOR, the depth of the network will be on the order of log N, right, on this type of XOR tree. So the number of nodes and the number of circuit components or the number of gates in this network You don't need that many gates in order to compute the exclusive OR. But now, if you're not allowed to use a neural network with multiple hidden layers, with, in this case, order log N hidden layers, if you're forced to compute this function, with just one hidden layer, right, so you have all these things going into, you know, set of hidden units and then these things then outputs Y. Then in order to compute the parity of X, to compute this XOR function, this hidden layer will need to be exponentially large. Because essentially, you need to exhaustively enumerate all two to the N possible configurations or the order of two to the N possible configurations of the input bits that result in the exclusive OR being either one or zero. So you end up needing a hidden layer that is exponentially large in the number of bits. I think technically you could do this with two to the N minus one hidden units, right, but that's the order two to the N. So it's going to be exponentially large in the number of bits. So I hope this gives a sense that there are mathematical functions that are much easier to compute with deep networks than with shallow networks. I have to admit, I personally found the result from circuit theory less useful for gaining intuitions, but this is one of the results that people often cite when explaining the value of having very deep representations. Now, in addition to these reasons, preferring deep networks to be perfectly honest, I think the other reason the term deep learning has taken off is just branding, right? These things used to be called neural networks with a lot of hidden layers, but the phrase deep learning, you know, is just a great brand. It's so deep, right? So I think that once that term called on that really neural networks rebranded or neural networks with many hidden layers rebranded helped to capture the popular imagination as well. So I think deep networks do work well. Sometimes people go overboard and insist on using tons of hidden layers, but when I'm starting on a new problem, I'll often really start out with even logistic regression and try something with one or two hidden layers and use that as a hyperparameter, use that as a parameter or hyperparameter that you tune in order to try to find the right depth for your neural network. But over the last several years, there has been a trend toward people finding that for some applications of neural networks, maybe many dozens of layers sometimes, can sometimes be the best model for a problem. So that's it for the intuitions for why deep learning seems to work well. Let's now take a look at the mechanics of how to implement not just forward propagation, but also back propagation.