Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
One of the most exciting developments with sequence-to-sequence models has been the rise of very accurate speech recognition. We're nearing the end of the course, so I want to take just a couple videos to give you a sense of how these sequence-to-sequence models are applied to audio data, such as the speech. So, what is the speech recognition problem? You're given an audio clip X, and your job is to automatically find a text transcript Y. So, an audio clip, if you plot it, looks like this. The horizontal axis here is time, and what a microphone does is it really measures minuscule changes in air pressure. And the way you're hearing my voice right now is that your ear is detecting little changes in air pressure, probably generated either by your speakers or by a headset. And so an audio clip like this plots, you know, basically air pressure against time. And if this audio clip is of me saying the quick brown fox, then hopefully a speech recognition algorithm can input that audio clip and output that transcript. And because even the human ear doesn't process raw waveforms, but the human ear has physical structures that measures the amount of intensity of different frequencies there is, a common pre-processing step for audio data is to run your raw audio clip and generate a spectrogram. So, this is a plot where the horizontal axis is time, and the vertical axis is frequencies, and the intensity of different colors shows the amount of energy. So, how loud is the sound at different frequencies at different times? And so, these types of spectrograms, or you might also hear people talk about filter bank outputs, is often a commonly applied pre-processing step before audio is passed into a learning algorithm. And the human ear does a computation pretty similar to this pre-processing step. So, one of the most exciting trends in speech recognition is that once upon a time, speech recognition systems used to be built using phonemes. And these were, I want to say, hand-engineered basic units of sound. So, the quick brown box, we represent it as phonemes. I'm going to simplify it a bit, but you say, the has a duh and an eh sound, and quick has a kuh and a wuh, ee, kuh sound. And linguists used to write out these basic units of sound and try to break language down into these basic units of sound. These aren't the official phonemes, which are written with more complicated notation. But linguists used to hypothesize that writing down audio in terms of these basic units of sound called phonemes would be the best way to do speech recognition. But with end-to-end deep learning, we're finding that phoneme representations are no longer necessary. But instead, you can build systems that input an audio clip and directly output a transcript without needing to use hand-engineered representations like these. One of the things that made this possible was going to much larger datasets. So, academic datasets on speech recognition might be as 300 hours. And in academia, a 3,000-hour dataset of transcribed audio would be considered a reasonable size. So, a lot of research has been done. A lot of research papers have been written on datasets that are several thousand hours. But the best commercial systems are now trained on over 10,000 hours and sometimes over 100,000 hours of audio. And it's really moving to much larger audio datasets, transcribed audio datasets with both X and Y, together with deep learning algorithms that has driven a lot of progress in speech recognition. So, how do you build a speech recognition system? In the last video, we'll talk about the attention model. So, one thing you could do is actually do that, where on the horizontal axis, you take in different time frames of the audio input. And then you have an attention model try to output the transcript, like the quick brown box or what was said. One other method that seems to work well is to use the CTC cost for speech recognition. CTC stands for Connectionist Temporal Classification and is due to Alex Gray, Santiago Fernandez, Faustina Gomez, and Jurgen Schmidhuber. So, here's the idea. Let's say the audio clip was of someone saying the quick brown box. We're going to use a neural network structured like this with an equal number of input Xs and output Ys. And I've drawn a simple unidirectional forward-only RNN for this. But in practice, this will usually be a bidirectional LSTM or bidirectional GRU and usually a deeper model. But notice that the number of time steps here is very large. And in speech recognition, usually the number of input time steps is much bigger than the number of output time steps. So, for example, if you have 10 seconds of audio and your features come at 100 hertz, so 100 samples per second, then a 10-second audio clip would end up with 1,000 inputs. So this 100 hertz times 10 seconds ends up with 1,000 inputs. But your output might not have 1,000 alphabets, might not have 1,000 characters. So what do you do? The CTC cost function allows the RNN to generate an output like this. TTT, there's a special character called a blank character, which I'm going to write as an underscore here. H, blank, E, E, E, blank, blank, blank. And then maybe a space, I'm going to write it like this, so that's a space. And then blank, blank, blank, Q, Q, Q, blank, blank. And this is considered a correct output for the first part of the space quick with the Q. And the basic rule for the CTC cost function is to collapse repeated characters not separated by blank. So to be clear, I'm using this underscore to denote the special blank character, and that's different than the space character. So there is a space here between the and quick, so it should output a space. But by collapsing repeated characters not separated by blank, it actually collapsed the sequence into T, H, E, and then space, and then Q. And this allows the neural network to have a thousand outputs by repeating characters a lot of times, or inserting a bunch of blank characters, and still end up with a much shorter output text transcript. So this space here, the quick brown fox, including spaces, actually has 19 characters. And if somehow the neural network is forced to output a thousand characters, by allowing the network to insert blanks and repeated characters, it can still represent this 19 character output with this 1,000 output values of Y. So this paper by Alex Grace, as well as Baidu's deep speech recognition system, which I was involved in, used this idea to build effective speech recognition systems. So I hope that gives you a rough sense of how speech recognition models work. Attention light models work, and CTC models work, and present two different options for how to go about building these systems. Now today, building an effective or production scale speech recognition system is a pretty significant effort and requires a very large data set. But what I'd like to do in the next video is share with you how you can build a trigger word detection system or a keyword detection system, which is actually much easier and can be done with even a smaller or more reasonable amount of data. So let's talk about that in the next video.