Generative AI - Part.2 🤖

Kwanwoo · Thu Jul 13 2023

💭 Introduction

In the first part of our discussion, we delved into the basic learning principles of Diffusion Models and their connection to Langevin stochastic differential equations. We considered a generative model that sequentially adds Gaussian noise, controlled by the variance size, to the training data and learns to denoise it. We introduced a machine learning model which learns denoising by estimating the noise injected at each point. Furthermore, we noted that the Diffusion Model provides probabilistic volatility by adding Gaussian noise to the output of the denoising model at each step.

Moreover, we explained the equivalence of this model to the Denoising Score Matching (DSM) technique, which learns the Score function of the conditional probability distribution. We presented how the model that learns denoising essentially approximates the Score function of the data distribution through the Score model learned by the DSM technique. This eventually leads the sampling algorithm of the Diffusion Model to approximate the Langevin Monte Carlo (LMC) algorithm, which computes the numerical solution of the Langevin equation.

We also pointed out that under certain conditions, the sequential sampling process of the LMC algorithm leads to a convergence toward the data distribution, guaranteeing the model's precision and diversity. Finally, we briefly compared the Diffusion Model to other generative models like GANs and VAEs in terms of their high-quality data generation capabilities.

🎯 Score-Based Generative Models and Stochastic Differential Equations

The learning and generation principles of Diffusion Models, which model multi-layer transformations and inverse transformations between data distribution and noise distribution, share many similarities with Flow Models. The difference lies in the fact that Flow Models learn deterministic functions for the transformations between layers, and use these functions to map random values drawn from the noise distribution to the data-defined space, thereby generating data. Both generative models use reversible transformations with a similar goal of sampling from the data distribution, but the implications and uses of the reversible transformations used in deterministic functions and stochastic processes are somewhat different. Deterministic functions require a bijective (one-to-one) condition, meaning that the inverse transformation between input and output must be well-defined. Hence, the types of functions that can be used in each layer of a Flow Model are relatively limited.

Figure 9: Comparison of the sampling process of Flow Models and Diffusion Models

On the other hand, in Diffusion Models, a stochastic process is considered reversible if it can restore the initial data distribution. If we can estimate the transition probability distribution of the reverse process, which is the forward process with the time axis reversed, using the transition probability distribution of the forward process, then reversible transformation is achieved, making it possible to sample from the data distribution through the reverse process. This is referred to as time reversal of the stochastic process. The principle of learning a Diffusion Model involves estimating the transition probability distribution of the reverse process, which denoises the stochastic path of the forward process injecting Gaussian noise [18,19]. Therefore, unlike Flow Models, the generation process in Diffusion Models involves sampling with probabilistic volatility.

⏳ Time Reversal of Stochastic Processes and Score-Based Generative Models

Remember that the noise injected in the sampling process of the Diffusion Model is a mechanism necessary to prevent mode collapse and ensure diversity. Interestingly, the sampling process, which denoises with this probabilistic volatility, ensures the time reversal of the forward process injecting noise (note: as mentioned earlier, even without injecting probabilistic volatility, it is possible to sample from the data distribution by proceeding with the sampling algorithm following the average path of the reverse process [23]). In general, time reversal is not always possible for any stochastic process. However, according to the time reversal theorem, if the reverse process of a Markov stochastic process exists, it again becomes a Markov stochastic process. The essential information connecting the two Markov stochastic processes is the Score function. The reason time reversal is possible in a Diffusion Model is also because the objective function is derived to learn the Score information of the data distribution by training a machine learning model to denoise [15]. Therefore, a denoising-based Diffusion Model can also be integrated into a Score-Based Generative Model (SGM) [23].

The previously introduced denoising-based DDPM[18] assumes that both the data space and the noise space are Euclidean. The time reversal theorem can be applied to discrete state spaces[24] or Riemannian manifolds[25], hence it can be applied to a wide range of data. For those interested in how the theory of Score-based generative models unfolds for Markov stochastic processes defined in a general state space, I recommend reading Benton et al. [26]. This part starts with the research of Song et al. [23], who used the time reversal theorem in Euclidean space to propose a learning method that expands the Diffusion Model, a discrete-time model, into a continuous-time SGM and can control the size of the noise injected into the data over time.

Song et al. [23] applied the time-reversal formula[27,28] to stochastic differential equations as follows, extending the discrete-time Diffusion Model to a continuous-time SGM and proposing a learning method that can control the size of the noise injected into the data over time. In Videos 10 and 11, you can see how the noise injection and generation occur along the paths of the Forward and Reverse stochastic differential equations.

Looking at the drift term of the above Reverse SDE equation, we see the time-dependent Score function. To perform time reversal on a stochastic differential equation with time-dependent coefficients like this, we need information about the Score function for the distribution of , not the Score function of the data distribution like in the Langevin equation. Therefore, Score-based generative models train the Score model to approximate the Score function of at each time point t, which is used when sampling by substituting it into the drift term of the Reverse SDE.

Video 10: The path of the stochastic process where noise is injected into the data by the Forward stochastic differential equation

Video 11: The path of the stochastic process where data is generated from noise by the Reverse stochastic differential equation

🖼️ Is time reversal possible with a probabilistic process other than de-noising?

Looking at the Forward formula proposed by DDPM[18] or Song et al.[23], it becomes clear that the probabilistic process following the Gaussian distribution plays an important role. With the attention garnered by Diffusion Model and SGM research, researchers in generative model asked this question:

"Do we necessarily need Gaussian noise in the learning of Diffusion Model or SGM?"

As previously explained, the time reversal of a stochastic process is not limited to the Gaussian stochastic process. For instance, there are SGM studies based on discontinuous Lévy stochastic processes[29]. But is it possible to design a stochastic process in a way other than de-noising and still make learning possible? Studies originating from this question were recently presented at the ICLR, one of the top conferences in the field of machine learning.

Figure 12: Comparison of Diffusion Model and IHDM[30]

In the Inverse Heat Dissipation Model (IHDM)[30], a generative model learning method was proposed using the heat equation instead of the probabilistic process of injecting Gaussian noise. Looking at Figure 12, one can observe the differences in the stochastic processes used by the Diffusion Model and the IHDM. The method of injecting noise used in the Diffusion Model destroys statistical correlation information between pixels, whereas the heat equation uses information from adjacent pixels to create a smooth result and diffuses locally concentrated information into the overall space. Looking at Figure 13, one can see the process by which the characteristics of the objects in the image gradually blur due to the heat equation. The space formed by diffusing this data information is called the prior space, to distinguish it from the noise space.

Figure 13: The process of training a generative model using the heat equation and deblurring

How can one design a generative model learning that maps from prior space to data space? A natural idea that comes to mind is to have the machine learning model learn deblurring, as shown in Figure 13. However, the inverse process of the heat equation is essentially ill-posed, so learning to deblur alone has limitations in generating at the same quality as actual data. In actual IHDM experiments, Gaussian noise is used not only for the heat equation but also for both learning and sampling processes, so learning takes place by mixing de-noising and de-blurring. But as mentioned earlier, de-noising based DSM induces learning of the Score function. Then, how can a generative model that learns de-blurring and de-noising together like IHDM make sampling possible by learning what information of data? Surprisingly, it also ends up learning the Score information of the data! How is this possible when de-noising and de-blurring are completely different concepts?

Actually, this is because the nature of the data space used in the Diffusion Model and the data space defined in IHDM is different. First, in the heat equation, the Laplace operator Δ must be defined. So, in IHDM, data is defined as a function, making the data space a function space. And the grid points inside the function domain are expressed as the positions of pixels, and the actual image data we observe is recognized as being projected onto the basis in the finite dimensional space. Therefore, there is a limitation in explaining the learning principle of IHDM with the time reversal theory of stochastic differential equations defined in the finite-dimensional space.

To answer this question, one must look for answers in the theory of stochastic partial differential equations (SPDE). The SPDE has been a steadily researched stochastic equation in the field of control and filtering theory in electrical and electronic engineering since the mid-20th century. From the 21st century, it has been actively applied in various fields of science and engineering, recognized for its scientific contributions in the field of statistical physics, such as Martin Hairer receiving the Fields Medal in 2014.

The equation above is a kind of SPDE, a stochastic heat equation, which is a form in which white noise is added to the diffusion term of the heat equation (Note: To discuss the regularity of the solution, it is actually better to use colored noise rather than white noise. For a detailed discussion on this topic, we recommend references [31,33]). If the diffusion coefficient depends only on the time variable t and not on the space variable x, the solution of the equation can be expressed in the following form [31].

Here, is an exponent operator with a Laplace operator, which outputs the solution of the heat equation when the initial condition is given as an input. So, what would the inverse stochastic process of such a stochastic heat equation look like? To derive this, one needs to use the time reversal formula of stochastic processes defined in Hilbert space [33].

Here, is an operator on the function space where the data space is defined and is called the Score operator [33,34]. The Score operator is a concept that generalizes the Score function in finite dimensions to Hilbert space. Therefore, the inverse stochastic process of the stochastic heat equation is guaranteed by the time reversal of the stochastic heat equation by the combination of the operator that performs de-blurring and the Score operator that performs de-noising. The learning method of the generative model presented in IHDM is also derived in a way that learns this combination of operators.

Since IHDM is a discrete-time model, it is difficult to view it as a generalization of SGM proposed by Song et al.[23]. To design a generative model based on the time reversal theory of continuous-time stochastic equations defined in Hilbert space, a rigorous approach to the Score operator is required. Denoising Diffusion Operator (DDO)[34] proposed a generative model based on the Langevin equation defined in Hilbert space based on the theory of invariant measures. Hilbert Diffusion Model (HDM)[33] proposed a generative model using SPDE, and derived a time reversal formula that can use scheduling in the continuous-time model for higher quality data sampling.

Figure 14: Images generated by a Score-based generative model trained using SPDE [33]

A Score-based generative model trained using SPDE has various advantages. It can sample from prior distributions created not only by Gaussian noise or heat equations but also by blurring or pixelation. As can be seen in Figure 15, it is capable of performing image restoration tasks such as inpainting or super-resolution without additional training. Moreover, because the generative model is designed based on the time reversal theory in function space, it becomes possible to generate resolution-free creations if the neural network structure is converted to a neural operator[35].

Figure 15: The generative model trained with SPDE can sample from various prior distributions without additional training[33]

💡 Conclusion

In this article, we have examined the learning principles of the Diffusion Model and Score-based generative model, which are receiving attention in the field of generative artificial intelligence research, and how stochastic partial differential equations can be used in generative model research. While we focused on the heat equation used in IHDM [30] in the main text, it is possible to propose generative models based on the time reversal theory in Hilbert space for a much wider range of operators, as demonstrated in Cold Diffusion [32] [33,34]. It can also be used in the field of motion generation, where the sampling of smooth curves or surfaces is required.

Predicting which generative model will become mainstream in the rapidly evolving field of AI research is very difficult. Just six years ago, researchers in the field of generative models were mainly focused on GANs or VAEs, and it is possible that other generative model techniques that can replace Diffusion Models will emerge. Moreover, to improve the sampling quality of generative models, theoretical approaches and various engineering technologies need to be utilized [4,7]. However, from the perspective of studying machine learning principles using probabilistic methods, I believe it is one of the research fields where many unanswered questions still exist. For students interested in probability theory or statistics, I highly recommend this field as one worth challenging, and with this, I conclude this article.

📄 Reference

[1] Generative Adversarial Nets, Goodfellow et al., NeurIPS (2014)

[2] Auto-Encoding Variational Bayes, Kingma et al., ICLR (2013)

[3] Variational Inference with Normalizing Flows, Rezende et al., ICML (2015)

[4] High-Resolution Image Synthesis With Latent Diffusion Models, Rombach et al., CVPR (2022)

[5] Zero-Shot Test-to-Image Generation, Ramesh et al., ICML (2021)

[6] Hierarchical Text-Conditional Image Generation with CLIP Latents, Ramesh et al., arXiv (2022)

[7] Diffusion Models Beat GANs on Image Synthesis, Dhariwal & Nichol, NeurIPS (2021)

[8] Neural Discrete Representation Learning, Oord et al., NeurIPS (2017)

[9] Generating Diverse High-Fidelity Images with VQ-VAE-2, Razavi et al., NeurIPS (2019)

[10] Large Scale GAN Training for High Fidelity Natural Image Synthesis, Brock et al., ICLR (2019)

[11] Extracting and composing robust features with denoising autoencoders, Vincent et al., ICML (2008)

[12] Generalized Denoising Auto-Encoders as Generative Models , Bengio et al., NeurIPS (2013)

[13] A fast learning algorithm for deep belief nets, Hinton et al., Neural Computation (2006)

[14] Estimation of Non-Normalized Statistical Models by Score Matching, A. Hyvärinen, JMLR (2005)

[15] A Connection Between Score Matching and Denoising Autoencoders, P. Vincent, Neural Computation (2011)

[16] A bound for the error in the normal approximation to the distribution of a sum of dependent random variables, C. Stein, Proc. Sixth Berkeley Symp. Math. Stat. Prob. (1972)

[17] A Kernelized Stein Discrepancy for Goodness-of-fit Tests, Liu et al., ICML (2016)

[18] Denoising Diffusion Probabilistic Models, Ho et al., NeurIPS (2020)

[19] Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Sohl-Dickstein et al., ICML (2015)

[20] On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness, Erdogdu & Hosseinzadeh, COLT (2021)

[21] Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling, De Bortoli et al., NeurIPS (2021)

[22] Convergence of denoising diffusion models under the manifold hypothesis, V. De Bortoli, TMLR (2022)

[23] Score-Based Generative Modeling through Stochastic Differential Equations, Song et al., ICLR (2021)

[24] First Hitting Diffusion Models for Generating Manifold, Graph and Categorical Data, Ye et al., NeurIPS (2022)

[25] Riemannian Score-Based Generative Modelling, De Bortoli et al., NeurIPS (2022)

[26] From Denoising Diffusions to Denoising Markov Models, Benton et al., arXiv (2022)

[27] Reverse-time diffusion equation models, B. D. O. Anderson, Stochastic Processes and their Applications (1982)

[28] Time reversal of diffusions, Haussmann & Pardoux, The Annals of Probability (1986)

[29] Score-based Generative Models with Lévy Processes, Yoon et al., NeurIPS SBM Workshop (2022)

[30] Generative Modelling with Inverse Heat Dissipation, Rissanen et al., ICLR (2023)

[31] Stochastic Equations in Infinite Dimensions, Da Prato & Zabczyk, Cambridge University Press (2014)

[32] Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise, Bansal et al., arXiv (2022)

[33] Score-based Generative Modeling through Stochastic Evolution Equations in Hilbert Spaces, Lim et al., arXiv (2023)

[34] Score-based Diffusion Models in Function Space, Lim et al.,arXiv (2023)

[35] Neural Operator: Learning Maps Between Function Spaces, Kovachki et al., JMLR (2023)

[36] Probability partial differential equations and artificial intelligence, Horizon (2023)