Generative AI - Part.1 🤖

Kwanwoo · Wed Jul 12 2023

💭 Introduction

The pace of advancement in Generative AI, a technology that's been gaining attention recently, is nothing short of astonishing. It can be used to generate high-quality data in various forms, not just visual data, but also text and audio. Moreover, it's possible to input text and output videos or audios in desired formats, even generating natural motions for robots or game characters, or editing based on text.

Reference 1: Example of a video generated by inputting text (link)

Reference 2: Example of a voice generated by inputting text (link)

Reference 3: Example of a motion generated by inputting text (link)

Reference 4: Example of editing motion with text (link)

So, on what principle do generative models, the core of generative AI, learn and operate? What does 'generation' mean? In the fields of statistics or machine learning, generation is defined as sampling from a probability distribution in data space, which was not observed or used during the learning process.

The sudden mention of probability distribution and sampling may confuse some readers. Here, sampling refers to the process of generating numbers via an algorithm to follow a specific probability distribution, not the process of choosing from given data. A computer recognizes and outputs data as numbers. The videos or texts we see through a monitor are representations decoded from numbers output by a computer so that humans can understand. If AI produces videos or texts that seem real or as if a human had written them, it means the distribution of the values output by the generative model and the distribution of the real encoded data have very similar patterns, making it statistically hard to distinguish between the two.

Figure 1: Generative models and sampling. The goal of the generative model is to sample from data distribution (link)

If the shape of data distribution is relatively simple like a normal distribution, and parameters are known, then sampling is not that difficult. However, as the probability distribution of actual data such as videos or text is expressed in a high-dimensional space, the shape is extremely complex. Therefore, estimating meaningful parameters during sampling or generating like real data are both challenging issues, and sampling widely and variably from the defined area of distribution (support) is an even more difficult problem.

The introduction of GAN[1] in 2014 marked the beginning of significant attention on generative model research. GAN was inspired by the idea of learning generative models based on two-player game theory. It proposed a method of learning generative models through a repeated min-max optimization process between the generator, which creates data similar to real data, and the discriminator, which distinguishes between the two types of data. The quality of the videos generated by GAN was so good that it gained significant attention from AI developers. If you look at the image below, you can see the astonishing rate of progress over a short period of 7 years.

Figure 2: Yearly advancement of video quality generated by GAN (link)

GAN is an excellent generative model, but it has a few downsides. One is the difficulty of sampling high-quality data variably (known as the fidelity-diversity tradeoff), and another is that it's hard to know what the shape of the probability distribution looks like, regardless of the model's performance. GAN is purely a methodology for sampling, so other methods must be used if probability distribution estimation is needed.

So, what kind of generative model is there that can generate high-quality data like GAN while also estimating the probability distribution? Models like VAE(Variational AutoEncoder)[2] and Flow Model[3], which emerged around the same time as GAN and have been noted by researchers in generative models, have made dazzling advancements. Among them, the comet-like emergence of the Diffusion Model has garnered significant attention from AI researchers and developers. There has been a recent explosive growth in companies that have released generative AI services based on Diffusion Models, such as Stable Diffusion[4] and DALL-E[5,6].

Figure 4: Image created by Carlo developed by Kakao Brain (photo: Kakao Brain)

While preparing this article, the most concerning point was deciding the target audience for introducing the learning principle of the Diffusion Model. After much consideration, the author decided to write this with undergraduate students or new graduate students interested in how probability and statistical techniques are used in the field of generative AI research in mind. Some aspects might be glossed over under the name of intuition. In fact, to properly understand the background knowledge of the Diffusion Model, a solid foundation in probability and stochastic differential equations is required. If readers are already knowledgeable in this area, kindly skip Part 1.

🤨 So, what is the learning principle of the Diffusion Model?

The ultimate goal of a generative model, as explained in the previous article, is to sample data from a probability distribution defined in the data space. In other words, it can be understood as training a machine learning model that maps from a noise space to a data space. To elaborate further in probabilistic terms, we want to find a function . that maps a random variable defined in the noise space to a random variable of the data distribution defined in the data space . Here, represents the parameters of the machine learning model.

Figure 5: The main goal of generative model training is to find a function that maps from the noise space to the data space.

The properties that a machine learning model must satisfy in the generation problem can be written as follows:

The sampled data should be of similar quality to the actual data. That is, the data generated through the model should be composed of patterns actually observable from the data distribution. To elaborate, the probability of the output of the machine learning model belonging to the support set of the data distribution should be high (Note: While some papers confuse this with the term "manifold," "support" is the more accurate term).

Sampling should be possible across the entire area of the data space through the generative model. For example, when an infinite number of random variables defined in the noise space are input to the machine learning model, the outputs should be able to cover most of the support set of the data distribution. In mathematical terms, it should satisfy the conditions of , and furthermore, no mode collapse should occur.

These two properties are related to fidelity and diversity, which are used to evaluate generative models. If the first condition is not met, it would be hard to say that the data generated by the generative model was sampled from the data distribution. Therefore, the model's fidelity decreases, and repeated generation is required for sampling, leading to the cost of cherry-picking. If the second condition is not met, the data that can be sampled through the generative model is limited to certain areas, and it can be interpreted that the machine learning model learns biased patterns or spurious correlations.

One might wonder if the GAN[1] or VAE[2,3] based generative models that machine learning researchers have been studying for the past decade meet these conditions. Given the fast pace of deep learning-based generative model research and the amazing progress of recent research on both models[8,9,10], it's cautious to make a simple comparison. However, from the perspective of machine learning principles, it's quite a challenge to satisfy both properties at the same time. In the case of GANs, adversarial training is used to guide the generative model's learning so that the first condition is met. However, adversarial training interferes with the generative model learning to cover the entire data space, leading to the challenging issue of mode collapse, and thus it is difficult to achieve the second condition. VAEs use an encoder, which compresses feature information extracted from training data into a probability distribution in the latent space, and a decoder, which restores data from the compressed information, to train the generative model. Unlike adversarial training, it's possible to satisfy the second condition by optimizing the likelihood of the data distribution. However, there are limitations to the decoder learning to generate high-dimensional data such as images from the compressed latent space using only likelihood optimization[8,9].

How can we train a generative model that satisfies the two conditions mentioned above? The Diffusion Model introduced in this article addresses these two problems with a different approach. First, adding noise to the data can create a version of the original information that is partially corrupted.

If you want to restore the original data from the damaged data, you can estimate the size of the noise and subtract it from the damaged data. This is referred to as denoising. For example, as in Figure 6 below, it can be understood as a problem of estimating how much noise has been inserted into each pixel in an image where noise has been added.

Figure 6: Machine learning model performing denoising from an image with added noise

Let's call a machine learning model that learns how to denoise from data with added noise as . Our machine learning model is a model that inputs the data with noise added to the original data, estimates how much noise has been added, and outputs a new image .

As in Figure 6, if weak noise is added to the data, it may be possible to estimate the original data. However, as more strong noise is added, it becomes a difficult problem to accurately estimate the corrupted information. However, our goal is generation, not memorization. Therefore, we don't need to restore all the original information that existed in the training data (it can be a problem considering data security or privacy!), we want a model that learns patterns that are similar to the patterns and statistically similar patterns seen in actual data, powered by probabilistic variability. The methodology inspired by this idea is DAE (Denoising AutoEncoder). DAE was a method designed to replace RBM (Restricted Boltzmann Machine), which was studied by Geoffrey Hinton at the time, by learning to extract robust patterns from data rather than fragile information through the noise injection process, which can be used usefully in pre-training or unsupervised learning.

So what kind of information from the data distribution does the machine learning model learn through the denoising process? Interestingly, it is known that DAE is equivalent to the SM (Score Matching) technique proposed by Aapo Hyvärinen from an optimization perspective. The Score function (note: also called the Stein score function in honor of statistician Charles M. Stein) is defined as the gradient vector of the logarithm of the probability density function.

The SM technique proposes an objective function for the machine learning model to match the Score function of the data distribution. This technique was originally used to estimate the parameters of a statistical model that is difficult to normalize, and was designed to complement the shortcomings of the MCMC (Markov Chain Monte Carlo) sampling technique. But the denoising process introduced above can be connected with the SM technique. For example, when the original data is given, the probability distribution that the data with noise added is created has as a density function . If the noise follows a Gaussian distribution with mean and variance , the Score function of this conditional probability distribution is calculated as follows:

If the optimization proceeds so that the machine learning model learns the Score function of the above conditional probability distribution, it can estimate to what extent the size of the injected noise is relative to the variance when it receives data with added noise, and learns the path from the damaged information to the original data . This is called the DSM (Denoising Score Matching) based on the denoising technique, and Pascal Vincent proves that the objective function of the existing SM technique and DSM are equivalent, and the objective function of DSM and DAE is also equivalent. In other words, because the DAE technique and the SM technique have equivalent objective functions, when optimization is performed, they eventually get the same parameters. In conclusion, through the denoising process, the machine learning model learns the Score function information of the data distribution.

The patterns that can be learned through the denoising method introduced above help the generative model mimic the data distribution, but it is limited to generate high-quality data directly from the noise space. This is because the output of the model trained to denoise is no different from noise when the information of the original data is completely gone. Then, what if you use a model that learns to denoise at each step by gradually changing and adding noise? The key idea of the Diffusion Model is to use the principle of training a generative model by dividing the noise injection and denoising process into stages, as in Figure 7. Several machine learning researchers suggested similar ideas, but the paper that showed successful learning results in the field of image generation was the DDPM (Denoising Diffusion Probabilistic Model) methodology proposed by Ho et al. If you are curious about the theoretical origin of the Diffusion Model, I recommend looking at the Sohl-Dickstein et al. paper as well.

Figure 7: The process of adding noise to data (forward) vs. denoising (backward)

🤖 Diffusion Model approximates the solution of a stochastic differential equation

What is the relationship between a generative model that learns denoising and a stochastic differential equation? Let's sequentially add Gaussian noise , controlled by the size of the variance to the training data.

Let's assume a machine learning model that learns to denoise by estimating the noise injected at each point. To sample with a denoising model, as shown in Figure 7, denoising must be performed in reverse order, like a time traveler, from the final destination, the noise space, until it reaches the starting point, the data space. The Diffusion Model adds Gaussian noise by the amount of to the value output by the denoising model at each point to provide probabilistic volatility during this process, and uses it as the input for the next order (note: even without this new noise injection process, if is well trained, the sampling algorithm works, and if the initial random number is input identically, the model's output will also be identical. This is also an interesting topic, but I will omit a detailed explanation in this article. If you're curious, look for the Probability Flow ODE in the Song et al. [21] paper.).

As explained above, the generative model that learns denoising has an objective function equivalent to the DSM technique, which learns the Score function of the conditional probability distribution. Let's call the machine learning model learned by the DSM technique a Score model, and denote it as . Using the relationship between the denoising model and the Score model, the above formula can be rearranged in the form of an Euler-Maruyama approximation.

Here, using the equivalence of DSM and SM techniques, the target that the Score model approximates through learning eventually becomes the Score function of the data distribution. Therefore, the sampling algorithm of the Diffusion Model approximates the LMC (Langevin Monte Carlo) algorithm, which computes the numerical solution of the stochastic differential equation, the Langevin equation!

The probability distribution that is sequentially sampled through the LMC algorithm approaches the data distribution as and approach 0 under certain conditions [20]. For example, if the distance between the distributions is calculated as KL divergence (Kullback-Leibler divergence), the objective function used in the Diffusion Model pursues convergence of the following distribution.

In fact, under certain conditions imposed on the data distribution, it can be theoretically shown that the above convergence is satisfied [21] (note: if the data space exists on a low-dimensional manifold (manifold hypothesis), it changes to the Wasserstein distance function instead of KL divergence [22]). This naturally leads to the two properties that the generative model examined earlier should satisfy, precision and diversity. If is repeatedly sampled from outside the support of the data distribution, the KL divergence value becomes infinite, contradicting the above condition. Therefore, as the sampling algorithm progresses sequentially, the probability of being generated from outside the support of the data distribution converges to 0, ensuring precision. Also, due to the probabilistic volatility added during the sampling process, it must have a higher probability value within the support of the data distribution than . Therefore, in order for the above conditions to be satisfied, the size of the area where must decrease, which means that learning proceeds to ensure diversity. In summary, the Diffusion Model learning method naturally induces to sample evenly from the area where the data distribution is defined while maintaining quality similar to actual data.

Figure 8: Comparison between generative models GAN, VAE, Diffusion Model (source: Nvidia, link)

So far, I have explained the basic learning principle of the Diffusion Model and the connection between it and the Langevin stochastic differential equation, and looked at the advantages of the Diffusion Model as a generative model through this. In fact, additional engineering approaches are needed to generate high-quality diverse data, which cannot be guaranteed by the theory of stochastic differential equations alone. If readers are curious about these parts, I recommend reading the LDM (Latent Diffusion Model) [4] or Guided Diffusion [7] papers.