As part of this course by DeepLearning.ai, I hope to not just teach you the technical ideas in deep learning, but also introduce you to some of the people, some of the heroes in deep learning, the people that invented so many of these ideas that you learn about in this course or in this specialization. In these videos, I hope to also ask these leaders of deep learning to give you career advice for how you can break into deep learning, for how you can do research or find a job in deep learning. As the first of this interview series, I'm delighted to have present to you an interview with Geoffrey Hinton. Welcome, Geoff, and thank you for doing this interview with DeepLearning.ai. Thank you for inviting me. I think that at this point, you more than anyone else on this planet has invented so many of the ideas behind deep learning, and a lot of people have been calling you the godfather of deep learning, although it wasn't until we were just chatting a few minutes ago that I realized you think I'm the first one to call you that, which I'm quite happy to have done. But what I want to ask is, many people know you as a legend. I want to ask about your personal story behind the legend. So how did you get involved in, going way back, how did you get involved in AI and machine learning and neural networks? So when I was at high school, I had a classmate who was always better than me at everything. He was a brilliant mathematician. And he came into school one day and said, did you know the brain uses holograms? And I guess that was about 1966. And I said, sort of, what's a hologram? And he explained that in a hologram, you can chop off half of it, and you still get the whole picture. And the memories in the brain might be distributed over the whole brain. And so I guess he'd read about Lashley's experiments, where you chop out bits of a rat's brain and discover it's very hard to find one bit where it stores one particular memory. So that's what first got me interested in, how does the brain store memories? And then when I went to university, I started off studying physiology and physics. I think when I was at Cambridge, I was the only undergraduate doing physiology and physics. And then I gave up on that and tried to do philosophy, because I thought that might give me more insight. But that seemed to me actually lacking in ways of distinguishing when they said something false. And so then I switched to psychology. And in psychology, they had very, very simple theories. And it seemed to me it was, sort of, hopelessly inadequate for explaining what the brain was doing. So then I took some time off and became a carpenter. And then I decided I'd try AI. And I went off to Edinburgh to study AI with Longet Higgins. And he had done very nice work on neural networks. And he'd just given up on neural networks, and been very impressed by Winograd's thesis. So when I arrived, he thought I was kind of doing this old fashioned stuff, and I ought to start on symbolic AI. And we had a lot of fights about that. But I just kept on doing what I believed in. And then what? I eventually got a PhD in AI. And then I couldn't get a job in Britain. But I saw this very nice advertisement for Sloan Fellowships in California. And I managed to get one of those. And I went to California, and everything was different there. So in Britain, neural nets was regarded as kind of silly. And in California, Don Norman and David Ronald Hart were very open to ideas about neural nets. It was the first time I'd been somewhere where thinking about how the brain works, and thinking about how that might relate to psychology, was seen as a very positive thing. And it was a lot of fun there. In particular, collaborating with David Ronald Hart was great. I see. Right. So this is when you were at UCSD, and you and Ronald Hart, around what, 1982, wound up writing the seminal backprop paper. Actually, it was more complicated than that. What happened? In, I think, early 1982, David Ronald Hart and me, and Ron Williams, between us developed the backprop algorithm. It was mainly David Ronald Hart's idea. We discovered later that many other people have invented it. David Parker had invented it, probably after us, but before we published. Paul Wobos had published it already quite a few years earlier, but nobody paid much attention. And there were other people who developed very similar algorithms. It's not clear what's meant by backprop. But using the chain rule to get derivatives was not a novel idea. Why do you think it was your paper that helped so much the community latch on to backprop? It feels like your paper marked an inflection in the acceptance of this algorithm, whoever accepted it. So we managed to get a paper into Nature in 1986. And I did quite a lot of political work to get the paper accepted. I figured out that one of the referees was probably going to be Stuart Sutherland, who was a well-known psychologist in Britain. And I went to talk to him for a long time and explained to him exactly what was going on. And he was very impressed by the fact that we showed that backprop could learn representations for words. And you could look at those representations, which were little vectors, and you could understand the meaning of the individual features. So we actually trained it on little triples of words about family trees, like Mary has mother Victoria. And you'd give it the first two words, and it would have to predict the last word. And after you trained it, you could see all sorts of features in the representations of the individual words, like the nationality of the person and their what generation they were, which branch of the family tree they were in, and so on. That was what made Stuart Sutherland really impressed with it. And I think that was why the paper got accepted. Very early word embeddings, and you're already seeing features, learned features of semantic meanings emerge from the training algorithm. Yes. So from a psychologist point of view, what was interesting was it unified two completely different strands of ideas about what knowledge was like. So there was the old psychologist view that a concept is just a big bundle of features. And there's lots of evidence for that. And then there was the AI view of the time, which is a far more structuralist view, which was that a concept is how it relates to other concepts. And to capture concept, you'd have to do something like a graph structure, or maybe a semantic net. And what this back propagation example showed was you could give it the information that would go into a graph structure, or in this case, a family tree. And it could convert that information into features in such a way that it could then use the features to derive new consistent information, i.e. generalize. But the crucial thing was this to and fro between the graphical representation, or the tree structured representation of the family tree, and a representation of the people as big feature vectors. And the fact that from the graph-like representation, you could get to the feature vectors. And from the feature vectors, you could get more of the graph-like representation. So this is 1986. In the early 90s, Bengio showed that you could actually take real data, you could take English text, and apply the same techniques there, and get embeddings for real words from English text. And that impressed people a lot. I guess recently, we've been talking a lot about how fast computers like GPUs and supercomputers is driving deep learning. I didn't realize that back in between 1986 and the early 90s, it sounds like between you and Bengio, there was already the beginnings of this trend. Yes, there was a huge advance. I mean, in 1986, I was using a Lisp machine, which was less than a tenth of a megaflop. And by about 1993, or thereabouts, people were seeing like 10 megaflops. So it was a factor of 100. And that's the point at which it was easy to use, because computers were just getting faster. Over the past several decades, you've invented so many pieces of neural networks and deep learning. I'm actually curious of all of the things you've invented, which are the ones you're still most excited about today? So I think the most beautiful one is the work I did with Terry Sinofsky on Boltzmann machines. So we discovered there was this really, really simple learning algorithm that applied to great big, densely connected nets where you could only see a few of the nodes. So it would learn hidden representations. And it was a very simple algorithm. And it looked like the kind of thing you should be able to get in a brain because each synapse only needed to know about the behavior of the two neurons it was directly connected to. And the information that was propagated was the same. There were two different phases, which we called wake and sleep. But in the two different phases, you're propagating information in just the same way. Whereas in something like backpropagation, there's a forward pass and a backward pass. And they work differently. They're sending different kinds of signals. Right. So I think that's the most beautiful thing. And for many years, it looked just like a curiosity because it looked like it was much too slow. But then later on, I got rid of a little bit of the beauty. And instead of letting things settle down, just use one iteration in a somewhat simpler net. And that gave restricted Boltzmann machines, which actually worked effectively in practice. So in the Netflix competition, for example, restricted Boltzmann machines were one of the ingredients of the winning entry. In fact, a lot of the recent resurgence of neural nets and deep learning starting about, I guess, 2007 was the restricted Boltzmann machine and deep restricted Boltzmann machine work that you and your lab did. Yes. So that's another of the pieces of work I'm very happy with. The idea of that you could train a restricted Boltzmann machine, which just had one layer of hidden features. And you could learn one layer of features. And then you could treat those features as data and do it again. And then you could treat the new features you'd learned as data and do it again as many times as you liked. So that was nice. It worked in practice. And then Yiwei Tei realized that the whole thing could be treated as a single model, but it was a weird kind of model. It was a model where at the top you had a restricted Boltzmann machine, but below that you had a sigmoid belief net, which was something that Radford Neal had invented many years earlier. So it was a directed model. And what we'd managed to come up with by training these restricted Boltzmann machines was an efficient way of doing inference in sigmoid belief nets. So around that time, there were people doing neural nets who would use densely connected nets, but didn't have any good ways of doing probabilistic inference in them. And you had people doing graphical models, like Mike Jordan, who could do inference properly, but only in sparsely connected nets. And what we managed to show was there's a way of learning these deep belief nets, so that there's an approximate form of inference that's very fast. It just happens in a single forward pass. And that was a very beautiful result. And you could guarantee that each time you learned an extra layer of features, there was a bound. Each time you learned a new layer, you got a new bound, and the new bound was always better than the old bound. Yeah, or the variational bound showing that as you add layers. Yes, yeah. So that was the second thing that I was really excited by. And I guess the third thing was the work I did with Bradford Neal on variational methods. It turns out people in statistics had done similar work earlier, but we didn't know about that. So we managed to make EM work a whole lot better by showing you didn't need to do a perfect E-step. You could do an approximate E-step. And EM was a big algorithm in statistics, and we'd showed a big generalization of it. And in particular, in 1993, I guess, with Van Kamp, I did a paper that was, I think, the first variational Bayes paper, where we showed that you could actually do a version of Bayesian learning that was far more tractable by approximating the true posterior with a Gaussian. And you could do that in a neural net. And I was very excited by that. I see. Wow. Right. Yep. I think I remember all of these papers, the Neal and Hinton approximate EM paper. Right. Spent many hours reading over that. And I think, you know, some of the algorithms you use today, or some of the algorithms that lots of people use almost every day are what things like dropouts or, I guess, value activations came from your group? Yes and no. So other people have thought about rectified linear units. And we actually did some work with restricted Boltzmann machines showing that a ReLU was almost exactly equivalent to a whole stack of logistic units. And that's one of the things that helped ReLUs catch on. I was really curious about that. The ReLU paper had a lot of math showing that this function can be approximated to this really complicated formula. Did you do that math so your paper would get accepted into an academic conference? Or did all that math really influence the development of max of zero and x? That was one of the cases where actually the math was important to the development of the idea. So I knew about rectified linear units, obviously, and I knew about logistic units. And because of the work on Boltzmann machines, all of the basic work was done using logistic units. And so the question was, could the learning algorithm work in something with rectified linear units? And by showing the rectified linear units were almost exactly equivalent to a stack of logistic units, we showed that all the math would go through. I see. And it provided the inspiration, but today tons of people use ReLU and it just works without necessarily needing to understand the same motivation. Yeah. One thing I noticed later when I went to Google, I guess in 2014, I gave a talk at Google about using ReLUs and initializing with the identity matrix. Because the nice thing about ReLUs is if you keep replicating the hidden layers and you initialize with the identity, it just copies the pattern in the layer below. And so I was showing that you could train networks with 300 hidden layers, and you could train them really efficiently if you initialize with the identity. But I didn't pursue that any further, and I really regret not pursuing that. We published one paper with Kwok Lee showing you could initialize recurrent... showing you could initialize recurrent nets like that. But I should have pursued it further because later on these residual networks were really that kind of thing. Over the years, I've heard you talk a lot about the brain. I've heard you talk about the relationship between backprop and the brain. What are your current thoughts on that? I'm actually working on a paper on that right now. I guess my main thought is this. If it turns out the backprop is a really good algorithm for doing learning, then for sure evolution could have figured out how to implement it. I mean you have cells that can turn into either eyeballs or teeth. Now if cells can do that, they can for sure implement backpropagation. And presumably there's huge selective pressure for it. So I think the neuroscientist's idea that it doesn't look plausible is just silly. There may be some subtle implementation of it. And I think the brain probably has something that may not be exactly backpropagation, but is quite close to it. And over the years, I come up with a number of ideas about how this might work. So in 1987, working with Jay McClelland, I came up with the recirculation algorithm, where the idea is you send information around a loop and you try to make it so that things don't change as information goes around this loop. So the simplest version would be you have input units and hidden units, and you send information from the input to the hidden and then back to the input, and then back to the hidden, and then back to the input and so on. And what you want, you want to train an autoencoder, but you want to train it without having to do backpropagation. So you just train it to try and get rid of all variation in the activities. So the idea is that the learning rule for synapse is change the weight in proportion to the presynaptic input, and in proportion to the rate of change of the postsynaptic input. But in recirculation, you're trying to make the postsynaptic input, you're trying to make the old one be good and the new one be bad. So you're changing it in that direction. And we invented this algorithm before neuroscientists had come up with spike time dependent plasticity. Spike time dependent plasticity is actually the same algorithm, but the other way around, where the new thing is good and the old thing is bad in the learning rule. So you're changing the weight in proportion to the presynaptic activity times the new postsynaptic activity minus the old one. Later on, I realised in 2007 that if you took a stack of restricted Boltzmann machines and you trained it up, after it was trained, you then had exactly the right conditions for implementing backpropagation by just trying to reconstruct. If you looked at the reconstruction error, that reconstruction error would actually tell you the derivative of the discriminative performance. And at the first deep learning workshop at NIPS in 2007, I gave a talk about that, that was almost completely ignored. Later on, Yoshio Bengio took up the idea and has actually done quite a lot more work on that. And I've been doing more work on it myself. And I think this idea that if you have a stack of autoencoders, then you can get derivatives by sending activity backwards and looking at reconstruction errors is a really interesting idea and may well be how the brain does it. One other topic that I know you thought a lot about and that I hear you're still working on is how to deal with multiple timescales in deep learning. So can you share your thoughts on that? Yeah, so actually, that goes back to my first year as a graduate student. The first talk I ever gave was about using what I called fast weights. So weights that adapt rapidly, but decay rapidly, and therefore can hold short term memory. And I showed in a very simple system in 1973, that you could do true recursion with those weights. And what I mean by true recursion is that the neurons that are used for representing things get reused for representing things in the recursive call. And the weights that are used for representing knowledge get reused in the recursive call. And so that leaves the question of when you pop out of a recursive call, how do you remember what it was you're in the middle of doing? Where's that memory? Because you use the neurons for the recursive call. And the answer is you can put that memory into fast weights, and you can recover the activity states of the neurons from those fast weights. And more recently, working with Jimmy Barr, we actually got a paper in NIPS about using fast weights for recursion like that. So that was quite a big gap. The first model was unpublished in 1973. And then Jimmy Barr's model was in 2015, I think, or 2016. So it's about 40 years later. And I guess one other idea I've heard you talk about for quite a few years now, over five years, I think, is capsules. Where are you with that? Okay, so I'm back to the state I'm used to being in, which is I have this idea I really believe in, and nobody else believes it. And I submit papers about it, and they all get rejected. But I really believe in this idea, and I'm just going to keep pushing it. So it hinges on... There's a couple of key ideas. One is about how you represent multidimensional entities. And you can represent multidimensional entities by just a little vector of activities, as long as you know there's only one of them. So the idea is in each region of the image, you'll assume there's at most one of a particular kind of feature. And then you'll use a bunch of neurons, and their activities will represent the different aspects of that feature. Like within that region, exactly what are its x and y coordinates? What orientation is it at? How fast is it moving? What colour is it? How bright is it? And stuff like that. So you can use a whole bunch of neurons to represent different dimensions of the same thing, provided there's only one of them. That's a very different way of doing representation from what we're normally used to in neural nets. Normally in neural nets, we just have a great big layer, and all the units go off and do whatever they do. But you don't think of bundling them up into little groups that represent different coordinates of the same thing. So I think there should be this extra structure. And then the other idea that goes with that... So this means in the distributed representation, you partition the representation to have different subsets to represent... Right. I call each of those subsets a capsule. And the idea is a capsule is able to represent an instance of a feature, but only one. And it represents all the different properties of that feature. So it's a feature that has lots of properties, as opposed to a normal neuron in a normal neural net, which just has one scalar property. Sure. I see. Yep. Right. And then what you can do if you've got that, is you can do something that normal neural nets are very bad at, which is you can do what I call routing by agreement. So let's suppose you want to do segmentation. And you have something that might be a mouth and something else that might be a nose. And you want to know if you should put them together to make one thing. So the idea is you have a capsule for a mouth that has the parameters of the mouth, and you have a capsule for a nose that has the parameters of the nose. And then to decide whether to put them together or not, you get each of them to vote for what the parameters should be for a face. I see. Now if the mouth and the nose are in the right spatial relationship, they will agree. So when you get two capsules at one level voting for the same set of parameters at the next level up, you can assume they're probably right, because agreement in a high dimensional space is very unlikely. I see. And that's a very different way of doing filtering than what we normally use in neural nets. So I think this routing by agreement is going to be crucial for getting neural nets to generalize much better from limited data. I think it'd be very good at dealing with changes in viewpoint, very good at doing segmentation. And I'm hoping it'll be much more statistically efficient than what we currently do in neural nets, which is if you want to deal with changes in viewpoint, you just give it a whole bunch of changes in viewpoint and train it on them all. I see. Right, right. So rather than people that only supervise learning, you could learn this in some different way. Well, I've still planned to do it with supervised learning, but the mechanics of the forward pass are very different. It's not a pure forward pass in the sense that there's little, little bits of iteration going on where you think you found a mouth and you think you found a nose and you do a little bit of iteration to decide whether they should really go together to make a face. I see. And you could do backprops for all that iteration. So you can train it all discriminatively. And we're working on that now at my group in Toronto. So I now have a little Google team in Toronto, part of the brain team. I see. That's what I'm excited about right now. I see. Great. Yeah. Look forward to that paper when that comes out. Yeah. If it comes out. You know, you've worked in deep learning for several decades. I'm actually really curious, how has your thinking, your understanding of AI, you know, changed over these years? So I guess a lot of my intellectual history has been around backpropagation and how to use backpropagation, how to make use of its power. So to begin with, in the mid 80s, we were using it for discriminative learning and it was working well. I then decided by the early 90s that actually most human learning was going to be unsupervised learning. And I got much more interested in unsupervised learning. And that's when I worked on things like the wake sleep algorithm. And your comments at that time really influenced my thinking as well. So when I was leading Google Brain, our first project, I spent a lot of work in unsupervised learning because of your influence. Right. And I may have misled you. That is, in the long run, I think unsupervised learning is going to be absolutely crucial. Yeah. But you have to sort of face reality. And what's worked over the last 10 years or so is supervised learning, discriminative training, where you have labels, or you're trying to predict the next thing in a series, so that acts as the label. And that's worked incredibly well. And I still believe that unsupervised learning is going to be crucial and things will work incredibly much better than they do now when we get that working properly. But we haven't yet. Yeah. I think many of the senior people in deep learning, including myself, remain very excited about it. It's just none of us really have almost any idea how to do it yet. Maybe you do. I don't feel like I do. Well, variational autoencoders, where you use the re-parameterization trick, seem to be a really nice idea. And generative adversarial nets also seem to be a really nice idea. I think generative adversarial nets are one of the sort of biggest ideas in deep learning that's really new. I'm hoping I can make capsules that successful. But right now, generative adversarial nets, I think, have been a big breakthrough. What happened to sparsity and slow features, which were two of the other principles for building unsupervised models? I was never as big on sparsity as you were. But slow features, I think, is a mistake. You shouldn't say slow. The basic idea is right. But you shouldn't go for features that don't change. You should go for features that change in predictable ways. So here's the sort of basic principle about how you model anything. You take your measurements and you're applying non-linear transformations to your measurements until you get to a representation as a state vector in which the action is linear. So you don't just pretend it's linear like you do with Kalman filters, but you actually find a transformation from the observables to the underlying variables where linear operations like matrix multiplies on the underlying variables will do the work. So for example, if you want to change viewpoints, if you want to produce an image from another viewpoint, what you should do is go from the pixels to coordinates. And once you've got to the coordinate representation, which is the kind of thing I'm hoping catchers will find, you can then do a matrix multiply to change viewpoint. And then you can map it back to pixels. Right. That's why you did all that. It's a very, very general principle. That's why you did all that work on phase synthesis, right? We take a phase and compress it to a very low dimensional vector. And so you can filter that and get back other phases. I had a student who worked on that. I didn't do much work on that myself. I'm sure you still get asked all the time, if someone wants to break into deep learning, what should they do? So what advice would you have? I'm sure you've given a lot of advice to people in one-on-one settings, but for the global audience of people watching this video, what advice would you have for them to get into deep learning? OK, so my advice is sort of read the literature, but don't read too much of it. So this is advice I got from my advisor, which is very unlike what most people say. Most people say you should spend several years reading the literature and then you should start working on your own ideas. And that may be true for some researchers, but for creative researchers, I think what you want to do is read a little bit of the literature and notice something that you think everybody is doing wrong and contrarian in that sense. You look at it and it just doesn't feel right and then figure out how to do it right. And then when people tell you that's no good, just keep at it. And I have a very good principle for helping people keep at it, which is either your intuitions are good or they're not. If your intuitions are good, you should follow them and you'll eventually be successful. If your intuitions are not good, it doesn't matter what you do. Right, inspiring advice, so might as well go for it. You might as well trust your intuitions. There's no point not trusting them. I see, yeah. I usually advise people to not just read, but replicate published papers, and maybe that puts a natural limiter on how many you could do, because replicating results is pretty time-consuming. Yes, it's true that when you try and replicate a published paper, you discover all the little tricks necessary to make it to work. The other advice I have is never stop programming, because if you give a student something to do, if they're a bad student, they'll come back and say it didn't work. And the reason it didn't work would be some little decision they made that they didn't realise was crucial. And if you give it to a good student, like Yi Wai Teh, for example, you can give him anything, and he'll come back and he'll say it worked. I remember doing this once. And I said, but wait a minute, Yi, since we last talked, I realised it couldn't possibly work for the following reason. And Yi said, oh yeah, well, I realised that right away, so I assumed you didn't mean that. Yeah, that's good, yeah. Let's see, any other advice for people that want to break into AI and deep learning? I think that's basically, read enough so you start developing intuitions, and then trust your intuitions. I see, cool. And go for it. And don't be too worried if everybody else says it's nonsense. And I guess there's no way to know if others are right or wrong when they say it's nonsense, but you just have to go for it and then find out. Right, but there is one way, there's one thing, which is if you think it's a really good idea, and other people tell you it's complete nonsense, then you know you're really onto something. So one example of that is when Radford and I first came up with variational methods, I sent mail explaining it to a former student of mine called Peter Brown, who knew a lot about EM, and he showed it to people who worked with him, called the Della Pietra brothers, they were twins I think. Yes, yes. And he then told me later what they said, and they said to him, either this guy's drunk or he's just stupid. So they really, really thought it was nonsense. Now it could have been partly the way I explained it, because I explained it in intuitive terms. But when people, when you have what you think is a good idea and other people think it's complete rubbish, that's the sign of a really good idea. Oh, I see. Unless you're wrong. Oh, and research topics, you know, new grad students should work on, what, capsules and maybe unsupervised learning, any other? One good piece of advice for new grad students is, see if you can find an advisor who has beliefs similar to yours. Because if you work on stuff that your advisor feels deeply about, you'll get a lot of good advice and time from your advisor. If you work on stuff your advisor's not interested in, all you'll get is, you'll get some advice, but it won't be nearly so useful. I see. And last one on advice for learners. How do you feel about people entering a PhD program versus joining, you know, a top company or top research group in a corporation? Yeah, it's complicated. I think right now what's happening is, there aren't enough academics trained in deep learning to educate all the people we need educated in universities. There just isn't there just isn't the faculty bandwidth there. But I think that's going to be temporary. I think what's happened is most departments are being very slow to understand the kind of revolution that's going on. I kind of agree with you that it's not quite a second industrial revolution, but it's something on nearly that scale. And there's a huge sea change going on, basically because our relationship to computers has changed. Instead of programming them, we now show them, and they figure it out. That's a completely different way of using computers. And computer science departments are built around the idea of programming computers. And they don't understand that sort of this showing computers is going to be as big as programming computers. And so they don't understand that half the people in the department should be people who get computers to do things by showing them. I see. Right. So my own department refuses to acknowledge that it should have lots and lots of people doing this. It thinks they've got a couple of maybe a few more, but not too many. And in that situation, you have to rely on the big companies to do quite a lot of the training. So Google is now training people we call brain residents. I suspect the universities will eventually catch up. I see. Right. In fact, maybe a lot of students have figured this out. A lot of top PhD programs, over half the PhD applicants are actually wanting to work on showing rather than programming. Yes. Yeah. Cool. Yeah. Yeah. In fact, to give credit where it's due, whereas deeplearning.ai is creating a deep learning specialization, as far as I know, the first deep learning MOOC was actually yours, Torsten Corsera, back in 2012 as well. And somewhat strangely, that's when you first published the RMSProp algorithm, which also took off. Right. Yes. Well, as you know, that was because you invited me to do the MOOC. And then when I was very dubious about doing it, you kept pushing me to do it. So it was very good that I did, although it was a lot of work. Yes. Yes. And thank you for doing that. I remember you complaining to me how much work it was, and you're staying up late at night. But I think, you know, many, many learners have benefited from your first MOOC, and I'm still very grateful to you for it. So that's good. Yeah. Yeah. Over the years, I've seen you embroiled in debates about paradigms for AI, and whether there's been a paradigm shift for AI. Can you share your thoughts on that? Yes, happily. So I think in the early days, back in the 50s, people like von Neumann and Turing didn't believe in symbolic AI. They were far more inspired by the brain. Unfortunately, they both died much too young. And their voice wasn't heard. And in the early days of AI, people were completely convinced that the representations you needed for intelligence were symbolic expressions of some kind, sort of cleaned up logic, where you could do non-monotonic things, and not quite logic, but something like logic, and that the essence of intelligence was reasoning. What's happened now is there's a completely different view, which is that what a thought is, is just a great big vector of neural activity. So contrast that with the thought being a symbolic expression. And I think the people who thought that thoughts were symbolic expressions just made a huge mistake. What comes in is a string of words, and what comes out is a string of words. And because of that, strings of words are the obvious way to represent things. So they thought what must be in between was a string of words, or something like a string of words. And I think what's in between is nothing like a string of words. I think the idea that thoughts must be in some kind of language is as silly as the idea that understanding the layout of a spatial scene must be in pixels. Pixels come in. And if we could, if we had a dot matrix printer attached to us, then pixels will come out. But what's in between isn't pixels. And so I think thoughts are just these great big vectors. And the big vectors have causal powers. They cause other big vectors. And that's utterly unlike the standard AI view that thoughts are symbolic expressions. I see. Yep. I guess AI is certainly coming around to this new point of view these days. Some of it. I think a lot of people in AI still think thoughts have to be symbolic expressions. Thank you very much for doing this interview. It's fascinating to hear how deep learning has evolved over the years, as well as how you're still helping drive it into the future. So thank you, Jeff. Well, thank you for giving me this opportunity. Okay, thank you.