Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
One of the most exciting recent developments in deep learning has been the rise of end-to-end deep learning. So, what is end-to-end deep learning? Briefly, there have been some data processing systems or learning systems that require multiple stages of processing. And what end-to-end deep learning does is it can take all those multiple stages and replace it usually with just a single neural network. Let's look at some examples. Take speech recognition as an example. Where you go is to take an input x, such as an audio clip, and map it to an output y, which is a transcript of that audio clip. So, traditionally, speech recognition required many stages of processing. First, you would extract some features, some hand-designed features of the audio clip. So, if you've heard of MFCC, that's an algorithm for extracting a certain set of hand-designed features for audio. And then, having extracted some low-level features, you might apply a machine learning algorithm to find the phonemes in the audio clip. So, phonemes are the basic units of sound. So, for example, the word cat is made up of three sounds, the k, a, and t. So, you extract those, and then you string together phonemes to form individual words. And then you string those together to form the transcript of the audio clip. So, in contrast to this pipeline with a lot of stages, what end-to-end deep learning does is you can train a huge neural network to just input the audio clip and have it directly output the transcript. One interesting sociological effect in AI is that as end-to-end deep learning started to work better, there were some researchers that had, for example, spent many years of their career designing individual steps of the pipeline. So, there were some researchers in different disciplines, not just in speech recognition, maybe in computer vision and other areas as well, that spent a lot of time, you know, written multiple papers, maybe even built a large part of their career engineering features or engineering other pieces of the pipeline. And when end-to-end deep learning just took a large training set and learned the function mapping from X to Y directly, really bypassing a lot of these intermediate steps, it was challenging for some disciplines to come around to accepting this alternative way of building AI systems. It's because it really obsoleted, in some cases, many years of research in some of the intermediate components. It turns out that one of the challenges of end-to-end deep learning is that you might need a lot of data before it works well. So, for example, if you're training on 3,000 hours of data to build a speech recognition system, then the traditional pipeline, the full traditional pipeline, works really well. It's only when you have a very large data set, you know, want to say 10,000 hours of data and anything going up to maybe 100,000 hours of data, that the end-to-end approach then suddenly starts to work really well. So, when you have a smaller data set, the more traditional pipeline approach actually works just as well, often works even better. And you need a large data set before the end-to-end approach really shines. And if you have a medium amount of data, then there are also intermediate approaches where maybe you input audio and bypass the features and just learn to output the phonemes of the neural network and then have some other stages as well. So, this would be a step toward end-to-end learning, but not all the way there. So, this is a picture of a face recognition turnstile built by a researcher, Yuanqing Lin at Baidu, where this is a camera, and it looks at the person approaching the gate, and if it recognizes the person, then the turnstile automatically lets them through. So, rather than needing to swipe an RFID patch to enter this facility, in increasingly many offices in China, and hopefully more and more in other countries as well, you can just approach the turnstile, and if it recognizes your face, it just lets you through without needing you to carry an RFID patch. So, how do you build a system like this? Well, one thing you could do is just look at the image that the camera is capturing. So, I guess this is my bad drawing, but maybe this is a camera image, and you have someone approaching the turnstile. So, this might be the image, X, that your camera is capturing. And one thing you could do is try to learn a function mapping directly from the image X to the identity of the person Y. It turns out this is not the best approach. And one of the problems is that the person approaching the turnstile could approach from lots of different directions. So, they could be in this green position, they could be in a blue position, and sometimes they're closer to the camera, so they appear bigger in the image, and sometimes they're already closer to the camera, so their face appears much bigger. So, what is actually done to build these turnstiles is not to just take the raw image and feed it to a neural net to try to figure out the person's identity. Instead, the best approach today seems to be a multi-step approach, where first, you run one piece of software to detect the person's face. So, there's first a detector to figure out where's the person's face. Having detected the person's face, you then zoom in to that part of the image, and crop that image so that the person's face is centered. Then, it is this picture, that I guess I drew here in red, this is then fed to a neural network to then try to learn or estimate the person's identity. And what researchers have found is that instead of trying to learn everything on one step, by breaking this problem down into two simpler steps, first is figure out where's the face, and second is look at the face and figure out who this actually is, the second approach allows the learning algorithm, or really two learning algorithms, to solve two much simpler tasks and results in overall better performance. By the way, if you want to know how the second step actually works, I've simplified the discussion. By the way, if you want to know how step two here actually works, I've actually simplified the description a bit. The way the second step is actually trained is you train a neural network that takes as input two images, and what the neural network does is it takes as input two images, and it tells you if these two are the same person or not. So if you then have, say, 10,000 employee IDs on file, you can then take this image in red and quickly compare it against maybe all 10,000 employee IDs on file, to try to figure out if this picture in red is indeed one of your 10,000 employees that you should allow into this facility or that you should allow into your office building, if this is a turnstile that is giving employees access to a workplace. So why is it that the two-step approach works better? There are actually two reasons for that. One is that each of the two problems you're solving is actually much simpler, but second is that you have a lot of data for each of the two subtasks. In particular, there is a lot of data you can obtain for face detection for task one over here, where the task is to look at an image and figure out where is the person's face in the image. So there is a lot of data. There is a lot of label data, x comma y, where x is a picture and y shows the position of the person's face. So you could build a neural network to do task one quite well. And then secondly, there is a lot of data for task two as well. Today, leading companies have, let's say, hundreds of millions of pictures of people's faces. So given a closely cropped image like this red image or this one down here, today, leading face recognition teams have at least hundreds of millions of images that they can use to look at two images and try to figure out the identity or figure out if it's the same person or not. So there is also a lot of data for task two. But in contrast, if you were to try to learn everything at the same time, there is much less data of the form x comma y, where x is an image like this taken from a turnstile and y is the identity of a person. So because you don't have enough data to solve this end-to-end learning problem, but you do have enough data to solve sub-problems one and two, in practice, breaking this down to two sub-problems results in better data. Breaking it down to two sub-problems results in better performance than a pure end-to-end deep learning approach. Although if you had enough data for the end-to-end approach, maybe the end-to-end approach would work better, but that's not actually what works best in practice today. Let's look at a few more examples. Take machine translation. Traditionally, machine translation systems also had a long, complicated pipeline where you first take, say, English text and then do text analysis, basically extract a bunch of features off the text and so on. And after many, many steps, you then output, say, a translation of the English text into French. Because for machine translation, you do have a lot of pairs of English comma French sentences, end-to-end deep learning works quite well for machine translation. And that's because today it is possible to get a large data set of xy pairs where that's the English sentence and that's the corresponding French translation. So in this example, end-to-end deep learning works well. One last example. Let's say that you want to look at an x-ray picture of a hand of a child and estimate the age of the child. You know, when I first heard about this problem, I thought this is a very cool crime scene investigation task where you find, maybe tragically, the skeleton of a child and you want to figure out how old the child was. It turns out that typical application of this problem, estimating age of a child from hand x-ray is less dramatic than this crime scene investigation I was picturing. It turns out that pediatricians use this to estimate whether or not a child is growing or developing normally. But a non-end-to-end approach to this would be if you look at an image and then you segment out or recognize the bones. So, you know, just try to figure out where is that bone segment, where is that bone segment, where is that bone segment, and so on. And then, knowing the lengths of the different bones, you can sort of go to a look-up table showing the average bone length in a child's hand and then use that to estimate the child's age. And so this approach actually works pretty well. In contrast, if you were to go straight from the image to the child's age, then you would need a lot of data to do that directly. And as far as I know, this approach does not work as well today just because there isn't enough data to train this TOSS in an end-to-end fashion. Whereas in contrast, you can imagine that by breaking down this problem into two steps. Step one is a relatively simple problem. Maybe you don't need that much data. Maybe you don't need that many x-ray images to segment out the bones. And TOSS two, well, you know, by collecting statistics of a number of children's hands, you can also get decent estimates of that without too much data. So this multi-step approach seems, you know, promising. Maybe more promising than the end-to-end approach, at least until you can get more data for the end-to-end learning approach. So when end-to-end deep learning works, it can work really well and it can really simplify the system and not require you to build so many hand-designed individual components. But it's also not panacea. It doesn't always work. In the next video, I want to share with you a more systematic description of when you should and maybe when you shouldn't use end-to-end deep learning and how to piece together these complex machine learning systems.