Hi Jan, you've been such a leader for Deep Learning for so long, thanks a lot for doing this with us. Well, thanks for having me. So you've been working on neural nets for a long time, I'd love to hear your personal story of how did you get started in AI, how did you end up working with neural networks? So I got interested in, I was always interested in the intelligence in general, like the emergence of intelligence in humans, it got me interested in human evolution when I was a kid. It was in France. It was in France, I was in middle school or something, and I was interested in technology and space, etc. My favorite movie was 2001 A Space Odyssey, you had intelligent machines, space travel, and human evolution as kind of the themes, that was what I was fascinated by. So the concept of intelligent machine I think really kind of appealed to me. And then I studied electrical engineering, and when I was at school, I was maybe in second year of engineering school, I stumbled on a book, which was actually a philosophy book, it was a debate between Noam Chomsky, the computational linguist at MIT, and Jean Piaget, who is a cognitive psychologist, a psychologist of child development in Switzerland. And it was basically a debate between nature and nurture, where Chomsky arguing for the fact that language has a lot of innate structure, and Piaget saying a lot of it is learned, etc. And on the side of Piaget was a transcription of a person, each of these guys sort of brought their teams of people to argue for their side, and on the side of Piaget was Seymour Capert from MIT, who had worked on the perceptron model, one of the first machines capable of learning. And I'd never heard of the perceptron, and I read this article and I said, a machine capable of learning, that sounds wonderful. And so I started going to several university libraries and searching for everything I could find that talked about the perceptron, and realized there was a lot of papers from the 50s, but it kind of stopped at the end of the 60s, with a book that was co-authored by the same Seymour Capert. What year was this? So this was 1980, roughly. So I specialized, I did a couple of projects with some of the math professors in my school on neural nets, essentially, but there was no one I could talk to who had worked on this, because the field basically had disappeared in the meantime, in 1980 nobody was working on this. And I experimented with this a little bit, writing simulations of various kinds, reading about neuroscience. When I finished my engineering studies, I studied chip design, VLSI design, at the time, so it was something completely different. And when I finished, I really wanted to do research on this, and I figured out already that at the time, the important question was how you train neural nets with multiple layers. It was pretty clear in the literature of the 60s that that was the important question that had been left unsolved. And the idea of hierarchy and everything, I'd read Fukushima's article in the Neocognitron, which was this sort of hierarchical architecture, very similar to what we now call convolutional nets, but without the backprop-style learning algorithms. And I met people who were in a small independent lab in France, that they were interested in what they called at the time automata networks, and they gave me a couple of papers, people on Hotfield networks, which is not very popular anymore, but it's the first associative memories with neural nets, and that paper kind of revived the interest of some research communities into neural nets in the early 80s, where by mostly physicists, and so condensed matter physicists and a few psychologists, it was still not okay for engineers and computer scientists to talk about neural nets. And they also showed me another paper that had just been distributed as a preprint, whose title was Optimal Perceptual Inference, and this was the first paper on Boltzmann machines by Geoff Hinton and Thierry Sinovsky, and it was talking about hidden units, it was talking about basically the problem of learning multilayer neural nets that are more capable than just linear classifiers. So I said, I need to meet these people, because they are really interested in the right problem. And a couple of years later, after I started my PhD, I participated in a workshop in Les Ouches, and that was organized by the people I was working with, and Thierry was one of the speakers at that workshop, so I met him at that time. This is like early 80s now. This is 1985, early 1985. So I met Thierry Sinovsky in 1985 in the workshop in France in Les Ouches, and a lot of people were there from the early neural net jump-up field, and a lot of people working on theoretical neuroscience and stuff like that. It was a fascinating workshop. I met also a couple of people from Bell Labs, who eventually hired me at Bell Labs, but this was several years before I finished my PhD. So I talked to Thierry Sinovsky, and I was telling him about what I was working on, which was some version of Backprop at the time. This was before Backprop was a paper, and Thierry was working on NetTalk at the time. This was before the Rommelhardt-Hinton-Williams paper on Backprop had been published, but he was friends with Jeff. This information was circulating, so he was already working on trying to make this work for NetTalk, but he didn't tell me. And he went back to the US and told Jeff there is some kid in France who's working on the same stuff we were working on. And then a few months later, in June, there was another conference in France where Jeff was a keynote speaker, and he gave a talk on Boson machines. Of course, he was working on the Backprop paper. And he gave this talk, and then there was 50 people around him who wanted to talk to him, and the first thing he said to the organizer is, do you know this guy, Yann LeCun? And it's because he had read my paper in the proceedings that was written in French, and he could sort of read French, and he could see the math, and he could figure out it was sort of Backprop. And so we had lunch together, and that's how we became friends. I see. Well, that's because you basically, you know, independently, multiple groups independently reinvented or invented Backprop, pretty much. Right. Or realized that the whole idea of chain rule or what the optimal control people call the adjoint state method, which is, you know, really the context in which Backprop was really invented. It's the context of optimal control back in the early 60s. This idea that you could use gradient descent basically with multiple kind of stages is what Backprop really is. And that popped up in various contexts at various times. But I think, you know, the Ramon Hart Hinton Williams paper is the one that popularized it. I see. Cool. And then, you know, fast forward a few years, you wound up at AT&T Bell Labs, where you invented, among many things, Lynette, which we talked about in the course. And I remember when, way back, I was a summer intern at AT&T Bell Labs, work with Michael Kearns and a few others, and hearing about your work even back then. So tell me more about your AT&T Lynette experience. So what happened is I actually started working on computational net when I was a postdoc at University of Toronto with Geoff Hinton. I did the first experiment. I wrote the code there and did the first experiments there that showed that if you had a very small data set, I was, you know, the data set I was training on, there was no MNIST or anything like that back then. So, you know, I drew a bunch of characters with my mouse. I had an Amiga personal computer, which was the best computer ever. And, you know, I drew a bunch of characters and then used that, I used data augmentation to kind of, you know, increase it and then use that as a way to test the performance. And I compared things like, you know, fully connected nets, locally connected nets without shared weights, and then shared weight networks, which was basically the first com net. And that worked really well for small, you know, relatively small data sets, which could show that you get better performance and no overtraining with the convolutional architecture. And when I got to Bell Labs in October 1988, the first thing I did was first scale up the network, because we had faster computers. A few months before I got to Bell Labs, my boss at the time, Larry Jackal, who became my department head at Bell Labs, said, oh, we should order a computer for you before you come. What do you want? I said, well, you know, here at University of Toronto, there is a Sun 4, which was the latest, greatest stuff. It'd be great if we had one. And they ordered one, and I had one for myself. You know, at University of Toronto, it was one for the entire department. Well, I have one just for me. And so what Larry told me, he said, yeah, you know, at Bell Labs, you don't get famous by saving money. So that was, like, awesome. And they had been working already for a while on character recognition. They had this enormous data set called USPS that had 5,000 training samples. And so immediately, I, you know, trained a designer convolutional net, which was LeNet 1, basically, and trained it on this data set and got really good results, you know, better results than the other methods that they had tried on it and that other people had tried on this data set. So we knew we had something fairly early on. This was within three months of me joining Bell Labs. And so that was the first version of convolutional net, where we had a convolution with stride, and we did not have separate sub-sampling and pooling layers. So each convolution was actually sub-sampling directly. And the reason for this is that we just could not afford to have a convolution at every location. You know, there was just too much computation. So the second version had a separate convolution and pooling layer and sub-sampling. I guess that's the one that's called LeNet 1, really. We, you know, published a couple of papers on this in neural computation and at NIPS. And so one interesting story. I gave a talk at NIPS about this work. And Geoff Hinton was in the audience. And then I, you know, I came back to my seat. I was sitting next to him. And he said, you know, there's one bit of information in your talk, which is that if you do all the sensible things, it actually works. And in fact, shortly after, that line of work went on to make history because it became widely adopted. These ideas became widely adopted for reading checks. Yeah, so they became widely adopted within AT&T, but not very much outside. And, you know, I think it's a little difficult for me to really understand why. But there's several factors, I think. So, you know, this was back in the late 80s, and there was no internet. You know, we had email, we had FTP, but there was no internet, really. Nobody, you know, no two labs were using the same software or hardware platform, right? You know, some people had some workstations, others had, you know, other machines. You know, some people were using PCs, whatever. There was no such thing as Python or MATLAB or anything like that, right? People were writing their own code. I had written, I spent a year and a half basically writing. Me and Léon Bottou, when he was still a student, were working together. And we spent a year and a half basically just writing a neural net simulator. And at the time, you know, because there was no MATLAB or Python, you had to write your own interpreter, right, to kind of control it. So we wrote our own Lisp interpreter. And so all the net was written in Lisp using a numerical backend, very similar to what we have now with, you know, blocks that you can interconnect, and sort of differentiation, all that stuff that we are familiar with now with, you know, Torch and PyTorch and TensorFlow and other things. So then we developed a bunch of applications. We got together with a group of engineers, very smart people. Some of them had, you know, were like theoretical physicists who kind of turned engineer at Bell Labs. Chris Burgess was one of them, who then had a distinguished career at Microsoft Research afterwards. And Craig Noel and a bunch of other people. And we were collaborating with them to kind of make this technology practical. And so together we developed those character recognition systems. And that meant integrating convolutional nets with things like, similar to things that we now call CRFs for interpreting sequences of characters, not just sort of individual characters. Yeah, right. The net paper had partially on the neural network, and partially on the autonomous machinery to put it together. Yeah, that's right. So the first half of the paper is on convolutional nets, and the paper is mostly cited for that. And then the second half, very few people have read it. And it's about, you know, sort of sequence-level discriminative learning and basically structure prediction, you know, without normalization. So it's very similar to CRF, in fact. You know, we've seen CRF for several years. So that was very successful, except that the day we were celebrating the deployment of that system in a major bank, we worked with this group that I was mentioning that was kind of doing the engineering of the whole system. And then another product group in a different part of the country that belonged to a subsidiary of AT&T called NCR. So this is the company that has cash registers, right? They also build large—I mean, they build ATM machines, and they build large check-reading machines for banks. So they were the customers, if you want. They were using our check-reading systems, and they deployed it in a bank. It was—I can't remember which bank it was. They deployed also the ATM machines in a French bank, so they could read the checks you would deposit. And we were all at a fancy restaurant celebrating the deployment of this thing, when the company announced that it was breaking itself up. So this was 1995, and AT&T announced that it was breaking itself up into three companies. So there was AT&T, and then there was Lucent Technologies and NCR. So NCR was spun off, and Lucent Technologies was spun off. And the engineering group went with Lucent Technologies, and the product group, of course, went with NCR. And the sad thing is that the AT&T lawyers, in their infinite wisdom, assigned a patent. There was a patent on convolutional nets, which is thankfully expired. Expired in 2007, about 10 years ago. And they signed a patent to NCR, but there was nobody in NCR who actually knew even what a convolutional net was, really. And so the patent was in the hands of people who had no idea what they had. And we were in a different company that now could not really develop the technology, and then our engineering team was in a separate company, because we went with AT&T, the engineering went with Lucent, and the product group with NCR. So it was a little depressing. So in addition to your early work, when neural networks were hot, you kept persisting on neural networks, even when there was some sort of winter for neural nets. What was that like? Well, so I persisted and didn't persist in some ways. I was always convinced that eventually those techniques would come back to the fore, and people would figure out how to use them in practice, and it would be useful. So I always had that in the back of my mind. But in 1996, when AT&T broke itself up, and all of our work on character recognition basically was kind of broken up, because the product groups went in a separate way, I was also promoted to department head, and I had to figure out what to work on. And this was the early days of the internet. We're talking 1995. And I had the idea somehow that one big problem about the emergence of the internet was going to be to bring all the knowledge that we had on paper to the digital world. And so I started, actually, a project called Deja Vu, which was to compress scanned documents, essentially, so they could be distributed over the internet. And this project was really fun for a while and had some success, although AT&T really didn't know what to do with it. Yeah, I remember that. It was helping dissemination of online research papers. Yeah, that's right, exactly. And we scanned the entire proceedings of NIPS, and we made them available online to demonstrate how that worked. And we could compress high-resolution pages to just a few kilobytes. So Confidence, starting from some of your much earlier work, has now come and pretty much taken over the field of computer vision and starting to encroach significantly into even other fields. Tell me about how you saw that whole process. So I'll tell you how I thought this was going to happen early on. So first of all, I always believed that this was going to work. It required fast computers and lots of data, but I always believed somehow that this was going to be the right thing to do. What I thought, originally, when I was at Bell Labs, that there was going to be some sort of continuous progress along these directions as machines got more powerful. And we were even designing chips to run convolutional nets at Bell Labs. Bernard Bozer, actually, and Hans-Peter Graf, separately, had two different chips for running convolutional nets really efficiently. And so we thought there was going to be a pickup of this and growing interest and continuous progress for it. But in fact, because of the interest for neural nets dying in the mid-'90s, that didn't happen. So it was kind of a dark period of six or seven years, between 1995, roughly, and 2002, when basically nobody was working on this. In fact, there was a little bit of work. There was some work at Microsoft in the early 2000s on using convolutional nets for Chinese character recognition. Petit Simard. Petit Simard, yeah, exactly. And there was some other small work for, like, face detection and things like this in France and in various other places. But it was very small. I discovered, actually, recently that there's a couple groups that came up with ideas that are essentially very similar to convolutional nets, but never quite published it the same way for medical image analysis. And those were mostly in the context of commercial systems. And so it never quite made it to the population. I mean, it was after our first work on convolutional nets, and they were not really aware of it. But it sort of developed in parallel a little bit. So, you know, several people got kind of similar ideas, you know, several years interval. But then I was really surprised by how fast interest picked up after, you know, the ImageNet in 2012. So it's the end of 2012. It was kind of a very interesting event at ECCV in Florence, where there was a workshop on ImageNet. And everybody knew that, you know, Geoff Hinton's team, Alex Krzyzewski, and Ilya Suskever had won by a large margin. And so everybody was waiting for Alex Krzyzewski's talk. And most people in the computer vision community had no idea what a convolutional net was. I mean, they heard me talk about it. I actually had an invited talk at CVPR in 2000, where I talked about it. But most people, you know, had not paid much attention to it. Senior people did, you know, they knew what it was. But the more junior people in the community really had no idea what it was. And so Alex Krzyzewski just, you know, gives this talk. And he doesn't explain what a convolutional net is, because he assumes everybody knows, right? Because he comes from machine learning. So he says, you know, here is how everything's connected, and, you know, how we transform the data, and what results we get, kind of assuming that everybody knows what it is. And a lot of people are incredibly surprised. And you could see the opinion of people changing as he was kind of giving his talk, like, you know, very senior people in the field. So do you think that workshop was the defining moment that swayed a lot of the computer vision community? Yeah, definitely. That's where it happened, right there. So today, you retain a faculty position at NYU, and you also lead FAIR, Facebook AI Research. I know you have a pretty unique point of view on how corporate research should be done. Do you want to share your thoughts on that? Yeah, so I mean, one of the beautiful things that, you know, that I managed to do at Facebook in the last four years is that I was given a lot of freedom to set up FAIR the way I thought was the most appropriate. Because this was the first research organization within Facebook. Facebook is a sort of engineering centric company. And so far was really focused on sort of survival or short term things. And Facebook was, you know, about to turn 10 years old, had, you know, had a successful IPO, and was basically thinking about the next 10 years, right? I mean, Mark Zuckerberg was thinking, you know, what is going to be important for the next 10 years? And the survival of the company was not in question anymore. So this is the kind of transition where a large company can start to think, or it was not such a large company at the time, Facebook had 5,000 employees or so, but they had the luxury to think about the next 10 years and what would be important in technology. And Mark decided that, Mark and his team decided that AI was going to be a crucial piece of technology for connecting people, which is the mission of Facebook. And so they explored several ways to kind of build an effort in AI. They had a small internal group, engineering group, experimenting with commercial nets and stuff that were getting really good results on face recognition and various other things, which piqued their interest. And they explored the idea of, you know, hiring a bunch of young researchers or acquiring a company or things like this. And they settled on the idea of, you know, hiring someone senior in the field and then kind of setting up a research organization. And it was a bit of a culture shock initially, because the way research operates in the company is very different from engineering, right? You have longer timescales and horizon and researchers tend to be very conservative about the choice of places where they want to work. And I made very clear very early on that research needs to be open, that researchers need to not only be encouraged to publish, but be even mandated to publish and also be evaluated on criteria that are similar to what we use to evaluate, you know, academic researchers. And so, you know, what Mark and Mike Schroepfer, the CTO of the company, who's my boss now, said, you know, they said, Facebook is a very open company. We distribute a lot of stuff in open source. You know, Shrepp, the CTO, comes from the open source world. He was from Mozilla before that, and a lot of people came from that world. So that was in the DNA of the company. So that made me very confident that we could kind of set up an open research organization. And then the fact that the company is not obsessively compulsive about IP, as some other companies are, makes it much easier to collaborate with universities and have arrangements by which a person can have a foot in industry and a foot in academia. And you find that valuable yourself. Oh, absolutely, yes. Yeah. So if you look at my publications over the last four years, the vast majority of them are publications with my students at NYU. Because at Facebook, I did a lot of, you know, organizing the lab, hiring, you know, scientific direction and advising and things like this. But I don't get involved in individual research projects to get my name on papers. And, you know, I don't care to get my name on papers anymore. But it's... It's about setting up someone else to do great work rather than doing all the great work yourself. And you never want to put yourself, you know, you want to kind of stay behind the scene and you don't want to put yourself in competition with people, you know, in your lab in that case. I'm sure you get asked this a lot, but I hope you can answer for all the people watching this video as well. What advice do you have for someone wanting to get involved in AI, break into AI? I mean, it's such a different world now than it was when I got started. I think what's great now is it's very easy for people to get involved at some level, right? I mean, the tools that are available are so easy to use now, you know, with TensorFlow, PyTorch, whatever. You can, you know, you can have a relatively cheap computer in your, you know, in your bedroom and basically train your commercial net or current net to do whatever. And there's a lot of tools. You can learn a lot from online material about this without, you know, it's not very onerous. So you see high school students now playing with this, right? Which is kind of, which is great, I think. And there's certainly a growing interest from the student population to learn about machine learning and AI. And it's very exciting for young people that I find that wonderful, I think. So my advice is, if you want to get into this, make yourself useful. So make a contribution to an open source project, for example, or make an implementation of some standard algorithm that you couldn't find the code of online, but you'd like to, you know, make it available to other people. So take a paper that you think is important and then, you know, reimplement the algorithm and then put it up in open source package or contribute to one of those open source packages. And if the stuff you write is interesting and useful, you will get noticed, you know. Maybe you'll get a nice job at a company you really want a job at, or maybe you'll get accepted in your favorite PhD program or things like this. So I think that's a good way to get started. So open source contributions is a good way to enter the community. Yeah, that's right. You get back to learn. That's right. Thanks a lot, Jan. That was fascinating. Knowing you for many years, it's still fascinating to hear all these details of all the stories that have gone on over the years. Yeah, there's many, many stories like this that, you know, reflecting back at the moment when they happen, you don't realize, you know, what importance it might take 10 or 20 years later. Yeah, thank you. Thanks.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 1: Foundations of Convolutional Neural Networks

Convolutional Neural Networks

Computer Vision
Video
・
5 mins

Edge Detection Example
Video
・
11 mins

More Edge Detection
Video
・
7 mins

Padding
Video
・
9 mins

Strided Convolutions
Video
・
8 mins

Convolutions Over Volume
Video
・
10 mins

One Layer of a Convolutional Network
Video
・
16 mins

Clarifications about Upcoming Simple Convolutional Network Example Video
Reading
・
1 min

Simple Convolutional Network Example
Video
・
8 mins

Pooling Layers
Video
・
10 mins

Clarifications about Upcoming CNN Example Video
Reading
・
1 min

CNN Example
Video
・
12 mins

Clarifications about Upcoming Why Convolutions?
Reading
・
1 min

Why Convolutions?
Video
・
9 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Quiz

The Basics of ConvNets

Graded・Quiz

・

50 mins

Programming Assignments

(Optional) Downloading your Notebook and Refreshing your Workspace
Reading
・
5 mins

Convolutional Model, Step by Step

Graded・Code Assignment

・

3 hours

Convolution Model Application

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yann LeCun Interview
Video
・
27 mins

Week 2: Deep Convolutional Models: Case Studies