Thanks for sticking with me for this final optional video on image generation. So far this week we'll focus most of attention on text generation. Text generation is what a lot of users are using and is having the biggest impact of all the different tools of generative AI. But part of the excitement of generative AI is also image generation. They're also starting to be some models that can generate either text or images, and these are sometimes called multimodal models, because it can operate in multiple modalities, text, or images. What I'd like to do in this video is share with you how image generation works. Let's take a look. With just a prompt, you can use generative AI to generate a beautiful picture of a person that had never existed, or a picture of a futuristic scene, or a picture of a cool robot like this. How does this technology work? Image generation today is mostly done via a method called a diffusion model. Diffusion models have learned from huge numbers of images found on the Internet or elsewhere. It turns out that at the heart of a diffusion model is supervised learning. Here's what it does. Let's say the algorithm finds a picture on the Internet of an apple like this, and it wants to learn from pictures like this and hundreds of millions of others how to generate images. The first step is to take this image, and gradually add more and more noise to it. You go from this nice picture of an apple, to a noisier, to an even noisier, to finally a picture that looks like pure noise. Where all the pixels are chosen at random and it doesn't look at all like an apple. The diffusion model then uses pictures like these as data to learn using supervised learning, to take as input, a noisy image and output a slightly less noisy image. Specifically, it would create a dataset where, the first data point says if it's given the second input image, what we want the supervised learning algorithm to do is learn to output a cleaner version of this apple. Here's another data point, given this third image of an even noisier image, we would like the algorithm to learn to output a slightly less noisy version like this. Finally, given an image of pure noise like this fourth image, we would like it to learn to output a slightly less noisy picture here that suggests the presence of an apple. After training on maybe hundreds of millions of images, viral process like this, when you want to apply it to generate a new image, this is how you would run it. You will start off with a pure noise image. Start by taking a picture where, every single pixel in the picture is chosen completely at random. We then feed this picture to the supervised learning algorithm that we trained up on the previous line. When we feed in pure noise, it learns to remove a little bit of noise from this picture, and you may end up with a picture like this that suggests some sort of fruit in the middle, but we're not quite sure what it is yet. Given the second picture, we again feed it to the model, and it then takes away even a little bit more noise, and now it looks like we can see a noisy picture of a watermelon. Then if you apply this one more time, we end up with this fourth image, which looks like a pretty nice picture of a watermelon. I'm illustrating this process using four steps of adding noise on the previous slide, and four steps of removing noise on this slide. But in practice, maybe about 100 steps will be more typical for a diffusion model. This algorithm will work for generating pictures completely at random. But we want to be able to control the image it generates by specifying a prompt to tell it what we want it to generate. Let me describe a modification of the algorithm that lets you add text, or add a prompt to tell it what you want it to generate. In this training data, we're given pictures like this apple, as well as a description or a prompt that could have generated this apple. Here, I have a text description saying this is a red apple. Then we will same as before, add noise to this picture until we get the fourth image, which is pure noise. But we're going to change how we build the learning algorithm, which is, rather than inputting the slightly noisy picture and expecting it to generate a clean picture, we'll instead have the input A, to the supervised learning algorithm B, this noisy picture, as well as the text caption or the prompt that could have generated this picture, namely red apple. Given this input, we want the algorithm to output this clean picture of an apple. Similarly, we'll generate additional data points for the algorithm using the other noisy images. Where each time, given a noisy image, and the text prompt red apple, we want the algorithm to learn to generate a less noisy picture of a red apple. Having learned from a large dataset, when you want to apply this algorithm to generating, say, a green banana, this is what you do. Same as before, we start off with an image of pure noise. Every single pixel is chosen completely at random. If you wanted to generate a green banana, you input to the supervised learning algorithm, that picture of pure noise together with the prompt, "green banana". Now that it knows you want a green banana, hopefully the ovum will output a picture that maybe looks like this. Can't see the banana that clearly, but maybe there's a suggestion of some greenish fruit in the middle, and this is the first step of image generation. The next step is, we then take this image on the right, there was an output B, and feed that is the input A, with again, the prompt "Green banana" to get it to generate a slightly less noisy picture, and now we see clearly, looks like there's a green banana, but a pretty noisy one. We do this one more time and it finally removes most of the noise, until we end up with that picture of a pretty nice green banana. That's how diffusion models work for generating images. At the heart of this magical process of generating beautiful images is, again, supervised learning. Thanks for sticking with me for this optional video, and I look forward to seeing you next week where, we'll dive much more into applications being built using generative AI. I'll see you in the next video.