Understanding the 'Understanding Diffusion Models- A Unified Perspective' paper
Understanding Diffusion Models: A Unified Perspective
This is just me reading this paper and making some notes on the go. Let me know if someone wants to discuss the paper or find something intersting in my notes. Pardon my casual poor language skills throughout 😄
I will try to go through some concepts in detail, while we also talk briefly through other relevant concepts. The text is purely my interpretation of the published work by Calvin Luo, and please dont use these notes to understand the paper. Rather think of this text as an exercise to check if you understood the concepts better than me or not 😉
-
As I mentioned , these are just some notes. I’d suggest read the Paper first.
-
I wont go in detail of many concepts but sure you can.
-
Please let me know if you find any error in my understanding of concepts. Always feel great to be wrong 😭
Introduction
The goal of generative model is to learn a true data distribution p(x)
given some observed samples x
. The learned model can:
- Generate new samples from approximate data
- Evaluate the likelihood of sampled data
There are various directions for generative models as mentioned in text. Generative Adversarial Networks (GANs) learn to model sampling in an adversarial manner.
Adversarial means two sides which oppose each other. GANs have generator and discriminator networks both of which are trained in adversarial manner. One network tries to generate new data and the other attempts to predict if the output is fake or real data.
Likelihood-based
generative models are another class where the model assign high likelihood to the observed data samples. These include - Autoregressive models, Normalizing flows, and Variational Autoencoders (VAEs).
An autoregressive (AR) model is a type of statistical model used for analyzing and predicting time series data. It determine the probabilistic correlation between elements in a sequence, and use the knowledge derived to guess the next element in an unknown sequence. For example, during training, an autoregressive model processes several English language sentences and identifies that the word “is” always follows the word “it” It then generates a new sequence that has “it is” together.
Normalizing flows are like starting with a simple shape, such as a ball of clay, and transforming it into a complex shape by squeezing and stretching it multiple times. Each transformation is reversible, so we can always go back to the ball shape. This process helps us model and understand complex patterns in data by starting from something simple and making it more intricate step-by-step.
Thereare other types of genrative modeling like Energy based modeling
, where we energy function to learn the distribution. Related to this, Score based modeling
learn the score of energy based model as a neural network. In this paper, the author is describing both likelihood-basedd and score-based interpretations - and lots of maths BTS!! 🤓
Background
Latent Variable v/s Plato’s Allegory of Cave
Now this is something very interesting and intuitive interpretation of latent variable. For many modalities, the latent variable z
is an unseen random variable that is inferred from the observed data and is used to represent or generate the given data.
But what is the Palto’s allegory of cave and how is it related to latent variable?
In the Allegory, a group of people are chained inside a cave their whole life and can only see 2D shadows in front of them whch are generated by 3D objects (most probably some creatures or animals) passed before a fire. Now, everything they observe is a projection of actual objects which they never saw and would never see in future. To such people, they observe things by some high-dimensional abstarct concepts like- how far the shadow goes, speed of movement, or so on, but quite different concepts than normal people for sure. Since, the cave people can never see (or fully comprehend) the hidden objects they can still reason or infer about the objects using their own concepts. This reminds me of a very interesting reddit thread which could be a total BS but I enjoyed reading it thread link here.
Now analogously, the objects we see in everyday life can also be generated as function of some high-level representations - e.g., some absract properties like color, size, shape etc. Its just that our high-level representations are w.r.t 3D interpretation and cave people have the same for 2D shadows they see.
The allegory illustrates the idea of latent variables as potentially unobservable representations that determine observations - sounds cool. But, (of course there is a but here) the idea also illustrates to learn a representation of higher dimension. In generative modeling, we try to learn a lower dimensional latent representation than the higehr ones. Think of it as a form of compression, and at the same time we try to uncover semantically meaningful structure describing observations. Also, if we were to learn high-dimensional representation then we would need very strong priors and things would become more complicated over time.
So, this was the end of some cool analogy described in the paper. Its always nice to read some mix of philosophy and science. Now its time for some maths 😈
Evidence Lower Bound
We start with the latent variable \(z\) and data we observe with the joint distribution \(p(x, z)\)
In the “Likelihood based” generative modeling, we want to maximize the \(p(x)\) of all observed $ x $. There are two ways we could manipulate the joitn distribution to recover the likelihood of purely observed data \(p(x)\).
The first one is: we explicitly marginalize out the latent variable $ z $. The marginal distribution is obtained by integrating \(p(x, z)\) with respect to \(z\).
\[p(x) = \int p(x, z) \, dz\]Here, out latent variable \(z\) is a continous varibale, and we want to sum over all the possible values of latent variable (becasue \(x\) can occur with any value of \(z\))