Nanshan Jia, Tingyu Zhu, Haoyu Liu, Zeyu Zheng

University of California, Berkeley

{nsjia,tingyu_zhu,haoyuliu,zyzheng}@berkeley.eduCorresponding author.

###### Abstract

We propose a class of structured diffusion models, in which the prior distribution is chosen as a mixture of Gaussians, rather than a standard Gaussian distribution. The specific mixed Gaussian distribution, as prior, can be chosen to incorporate certain structured information of the data. We develop a simple-to-implement training procedure that smoothly accommodates the use of mixed Gaussian as prior. Theory is provided to quantify the benefits of our proposed models, compared to the classical diffusion models. Numerical experiments with synthetic, image and operational data are conducted to show comparative advantages of our model. Our method is shown to be robust to mis-specifications and in particular suits situations where training resources are limited or faster training in real time is desired.

## 1 Introduction

Diffusion models, since Ho etal. (2020); Song etal. (2020b), have soon emerged as a powerful class of generative models to handle the training and generation for various forms of contents, such as image and audio. On top of the success, like many other models, training a diffusion model can require significant computational resources. Compared to more classical generative models such as generative adversarial networks (GAN), the inherent structure of diffusion models requires multiple steps to gradually corrupt structured data into noise and then reverse this process. This necessitates a large number of training steps to successfully denoise, further adding to the computational cost associated with the network and data size.

That said, not all scenarios where diffusion models are used enjoy access to extensive training resources. For example, small-sized or non-profit enterprises with limited budget of compute may desire to train diffusion models with their private data. In those cases with limited resources, the training of standard diffusion models may encounter budget challenges and cannot afford the training of adequate number of steps. In addition, there are scenarios one desires to train in real time with streaming data and aims at achieving certain training performance level as fast as possible given a fixed amount of resources. In such cases, it is preferable to further improve classical diffusion models to achieve faster training.

If training resources are limited, insufficient training can hinder the performance of the diffusion models and result in poorly generated samples. Below in Figure 1 is an illustrative example on gray-scale digits images, showing the performance of denoising diffusion probabilistic models given different training steps. When the model is trained for only 800 steps, it has limited exposure to the data, and as a result, the generated samples are likely to be blurry, incomplete, or show a lack of consistency in terms of digit shapes and structures. The model at this stage has not yet learned to fully reverse the noise process effectively. Our work was motivated by the considerations to improve training efficiency so that, if resources are limited, even with fewer training steps one can achieve certain satisfying level of performance.

In this work, we aim to provide one approach based on adjusting the prior distribution, to improve the performances of classical diffusion models when training resources are limited. Classical diffusion models use Gaussian distribution as the prior distribution, which was designed due to the manifold hypothesis and sampling in low-density areas Song & Ermon (2019). However, this approach does not use the potential structured information within the data and considerably adds to the training complexity. Admittedly, when training resources are not a constraint, or when the data structure is difficult to interpret, the use of Gaussian distribution as a prior can be a safe and decent choice. That said, when users have certain structured domain knowledge about the data, say, there might be some clustered groups of data on some dimensions, it can be useful to integrate such information into the training of diffusion models. To increase the ability to incorporate such information, we propose the use of a Gaussian Mixture Model (GMM) as the prior distribution. We develop the associated training process, and examine the comparative performances. The main results of our work are summarized as follows.

1) We propose a class of mixed diffusion models, whose prior distribution is Gaussian mixture model instead of Gaussian noise in the previous diffusion literature. We detail the forward process and the reverse process of the mixed diffusion models, including both the Mixed Denoising Diffusion Probabilistic Models (mixDDPM) and the Mixed Score-based Generative Models (mixSGM) with an auxiliary dispatcher that assigns each data to their corresponding center.

2) We introduce a quantative metric “Reverse Effort” in the reverse process, which measures the distance between the prior distribution and the finite-sample data distribution under appropriate coupling. With the ’Reverse Effort’, we explain the benefits of the mixed diffusion models by quantifying the effort-reduction effect, which further substantiates the efficiency of this mixed model.

3) We conduct various numerical experiments among synthesized datasets, operational datasets and image datasets. All numerical results have advocated the efficiency of the mixed diffusion models, especially in the case when training resources are limited.

### 1.1 Related Literature

Diffusion models and analysis. Diffusion models work by modeling the process of data degradation and subsequently reversing this process to generate new samples from noise. The success of diffusion models lies in their ability to generate high-quality, diverse outputs. Their application has expanded across fields such as image and audio synthesis tasks Kong etal. (2020); Dhariwal & Nichol (2021); Leng etal. (2022); Rombach etal. (2022); Yu etal. (2024), image editing Meng etal. (2021); Avrahami etal. (2022); Kawar etal. (2023); Mokady etal. (2023), text-to-image generation Saharia etal. (2022); Zhang etal. (2023); Kawar etal. (2023), and other downstream tasks including in medical image generation Khader etal. (2023); Kazerouni etal. (2023) and modeling molecular dynamics Wu & Li (2023); Arts etal. (2023), making them a pivotal innovation in the landscape of generative AI. Tang & Zhao (2024) provide further understanding of score-based diffusion models via stochastic differential equations.

Other methods for efficiency improvement. Various literature have contributed to improve the performance of the diffusion models by proposing more efficient noise schedules Kingma etal. (2021); Karras etal. (2022); Hang & Gu (2024), introducing latent structures Rombach etal. (2022); Kim etal. (2023); Podell etal. (2024); Pernias etal. (2024), improving training efficiency Wang etal. (2023); Haxholli & Lorenzi (2023) and applying faster samplers Song etal. (2020a); Lu etal. (2022a, b); Watson etal. (2022); Zhang & Chen (2023); Zheng etal. (2023); Pandey etal. (2024); Xue etal. (2024); Zhao etal. (2024). In addition, Yang etal. (2024) employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step.

Use of non-Gaussian prior distribution. There exists a series of related but different work on using non-Gaussian noise distributions, to enhance the performance and efficiency of the diffusion models; see Nachmani etal. (2021); Yen etal. (2023); Bansal etal. (2024), among others. Our work instead emphasizes on the use of structured prior distribution (instead of noise distribution), with the purpose to focus on incorporating data information into the model.

The following of this paper are organized as follows. Section 2 reviews the background of diffusion models, including both Denoising Diffusion Probabilistic Model Ho etal. (2020) and Score-based Generative Models Song etal. (2020b). Section 3 starts from numerical experiments on 1D syntatic datasets to illustrate the motivation of this work. Section 4 details our new models and provide theoretical analysis. Section 5 includes numerical experiments and Section 6 provides extensions based on the variance estimation. Finally, Section 7 concludes the paper with future directions.

## 2 Brief Review on Diffusion Models and Notation

In this section, we briefly review the two classical class of diffusion models: Denoising Diffusion Probabilistic Model (DDPM) Ho etal. (2020) in Section 2.1, and Score-based Generative Models (SGM) Song etal. (2020b) in Section 2.2. Meanwhile, we specify the notation related to the diffusion models and prepare for the description of our proposed methods later in Section 4.

### 2.1 Denoising Diffusion Probabilistic Model

In DDPM, the forward process is modeled as a discrete-time Markov chain with Gaussian transition kernels. This Markov chain starts with the observed data ${\mathbf{x}}_{0}$, which follows the data distribution $p_{\mathrm{data}}$. The forward process gradually adds noises to ${\mathbf{x}}_{0}$ and forms a finite-time Markov process $\{{\mathbf{x}}_{0},{\mathbf{x}}_{1},\cdots,{\mathbf{x}}_{T}\}$. The transition density of this Markov chain can be written as

${\mathbf{x}}_{t}|{\mathbf{x}}_{t-1}\sim\mathcal{N}\left(\sqrt{1-\beta_{t}}{%\mathbf{x}}_{t-1},\beta_{t}\bm{I}\right),\quad t=1,2,\cdots,T,$ | (1) |

where $\beta_{1},\cdots,\beta_{T}$ is called the noise schedule. Then, the marginal density of ${\mathbf{x}}_{t}$ conditional on $x_{0}$ can be written in closed-form: ${\mathbf{x}}_{t}|{\mathbf{x}}_{0}\sim\mathcal{N}\left(\sqrt{\overline{\alpha}_%{t}}{\mathbf{x}}_{0},(1-\overline{\alpha}_{t})\bm{I}\right)$, where $\overline{\alpha}_{t}=\prod_{s=1}^{t}(1-\beta_{s})$ for $t=1,2,\cdots,T$. The noise schedule is chosen so that $\alpha_{T}$ is closet to $0$.

During the training process, a neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}:\mathbb{R}^{d}\times\{1,2,\cdots,T\}%\to\mathbb{R}^{d}$ parameterized by ${\mathbf{\theta}}$ is trained to predict the random Gaussian noise ${\mathbf{\epsilon}}$ given the time $t$ and the value of the forward process ${\mathbf{x}}_{t}$. The DDPM training objective is proposed as

$\mathcal{L}^{\mathrm{DDPM}}:=\sum_{t=1}^{T}\mathbb{E}_{{\mathbf{x}}_{0},\bm{%\epsilon}}\left[\omega_{t}\cdot\left\|{\mathbf{\epsilon}}-{\mathbf{\epsilon}}_%{{\mathbf{\theta}}}\left({\mathbf{x}}_{t},t\right)\right\|_{2}^{2}\right]\text%{ with }{\mathbf{x}}_{t}=\sqrt{\overline{\alpha}_{t}}{\mathbf{x}}_{0}+(1-%\overline{\alpha}_{t})\bm{{\mathbf{\epsilon}}}$ | (2) |

where $\omega_{1},\omega_{2},\cdots,\omega_{T}$ is a sequence of weights and $\|\cdot\|_{2}$ is the $l_{2}$ metric.

Based on the trained neural network, the reverse sampling process is also modeled as a discrete-time process with Gaussian transition kernels. Here and throughout what follows, we use $\tilde{{\mathbf{x}}}$ to denote the reverse process. The reverse process starts from the prior distribution $\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}(\bm{0},\bm{I})$ and the transition density is given by

$\tilde{{\mathbf{x}}}_{t-1}|\tilde{{\mathbf{x}}}_{t}\sim\mathcal{N}(\bm{\mu}_{{%\mathbf{\theta}}}(\tilde{{\mathbf{x}}}_{t},t),\beta_{t}\bm{I})\text{ with }\bm%{\mu}_{{\mathbf{\theta}}}({\mathbf{x}},t)=\frac{1}{\sqrt{1-\beta_{t}}}\left({%\mathbf{x}}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}{\mathbf{\epsilon}%}_{{\mathbf{\theta}}}({\mathbf{x}},t)\right)$ | (3) |

for $t=1,2,\cdots,T$. The final result $\tilde{{\mathbf{x}}}_{0}$ is considered to be the output of the DDPM and its distribution is used to approximate the data distribution $p_{\mathrm{data}}$.

### 2.2 Score-based Generative Models

Both of the forward process and the reverse process of the SGM are modeled by Stochastic Differential Equations (SDE). The forward SDE starts from ${\mathbf{x}}_{0}\sim p_{\mathrm{data}}$ and evolves according to

$d{\mathbf{x}}_{t}=f_{t}{\mathbf{x}}_{t}dt+g_{t}d{\mathbf{w}}_{t},$ | (4) |

where $f_{t}$ is the drift scalar function, $g_{t}$ is the diffusion scalar function and ${\mathbf{w}}_{t}$ is a d-dimensional standard Brownian motion. In particular, the forward SDE is an Ornstein–Uhlenbeck (OU) process and the marginal distribution has a closed-form Gaussian representation. Without loss of generality, we suppose ${\mathbf{x}}_{t}\sim\mathcal{N}(\alpha_{t}{\mathbf{x}}_{0},\sigma^{2}_{t}\bm{I})$, where $\alpha_{t}$ and $\sigma_{t}$ are solely determined by the scalar functions $f_{t},g_{t}$.

According to Anderson (1982), the reverse of diffusion process (4) is also a diffusion process and can be represented as

$d\tilde{{\mathbf{x}}}_{t}=\left(f_{t}\tilde{{\mathbf{x}}}_{t}-g_{t}^{2}\nabla_%{{\mathbf{x}}}\log p_{t}(\tilde{{\mathbf{x}}}_{t})\right)dt+g_{t}d\tilde{{%\mathbf{w}}}_{t},\quad\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}(\bm{0},\sigma^{2%}_{T}\bm{I}),$ | (5) |

where $dt$ is an infinitesimal negative time step and $\tilde{dw}$ is the standard Brownian motion when time flows back from $T$ to $0$. Besides, $\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}})$ is called the score function and is approximated by a trained neural network $\bm{s}_{{\mathbf{\theta}}}$. However, later researchers have substituted the score function by the noise model Lu etal. (2022a) and the prediction model Lu etal. (2022b) to improve the overall efficiency of the SGM. To keep align with the DDPM, we only introduce the noise model here.

Instead of learning the score function directly, the noise model utilizes a neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}({\mathbf{x}},t):\mathbb{R}^{d}\times(0%,T]\to\mathbb{R}^{d}$ to learn the scaled score function $-\sigma_{t}\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}})$. According to Lu etal. (2022a), the training objective is elected to be

$\mathcal{L}^{\mathrm{SGM}}:=\int_{0}^{T}\omega_{t}\cdot\mathbb{E}_{{\mathbf{x}%}_{0},{\mathbf{\epsilon}}}\left[\|{\mathbf{\epsilon}}_{{\mathbf{\theta}}}({%\mathbf{x}}_{t},t)-{\mathbf{\epsilon}}\|_{2}^{2}\right]dt\quad\text{with}\quad%{\mathbf{x}}_{t}=\alpha_{t}{\mathbf{x}}_{0}+\sigma_{t}{\mathbf{\epsilon}},$ | (6) |

where $\omega_{t}$ is a weighting function. Having the trained noise model ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}$, the previous reverse SDE (5) can be re-formalized as

$d\tilde{{\mathbf{x}}}_{t}=\left(f_{t}\tilde{{\mathbf{x}}}_{t}+\frac{g_{t}^{2}}%{\sigma_{t}}{\mathbf{\epsilon}}_{{\mathbf{\theta}}}(\tilde{{\mathbf{x}}}_{t},t%)\right)dt+g_{t}d\tilde{{\mathbf{w}}}_{t},\quad\tilde{{\mathbf{x}}}_{T}\sim%\mathcal{N}(\bm{0},\sigma^{2}_{T}\bm{I}).$ | (7) |

Various numerical SDE solvers can be applied on (7) to obtain the final output $\tilde{{\mathbf{x}}}_{0}$.

## 3 Illustration with One-dimensional Examples

In this section, we provide a brief numerical illustration to show the performance comparison between DDPM and mixDDPM (the method that will be formally introduced in the next section). We illustrate through two 1-dimensional experiments. The true data distribution for the first experiment is a standardized Gaussian mixture distribution with symmetric clusters $p_{\mathrm{data}}=\frac{1}{2}(\mathcal{N}(-0.9,0.19)+\mathcal{N}(0.9,0.19))$. The second experiment chooses a Gamma mixture distribution as the data distribution that shares the same cluster mean and variance with the above-mentioned Gaussian mixture distribution.

DDPM | mixDDPM | |
---|---|---|

$\mathcal{W}_{1}$ distance | 0.222 | 0.113 |

K-S statistics | 0.213 | 0.073 |

DDPM | mixDDPM | |
---|---|---|

$\mathcal{W}_{1}$ distance | 0.206 | 0.136 |

K-S statistics | 0.232 | 0.103 |

We present the results in Table LABEL:1DGMM;data and Table LABEL:1DGamma;data. The experiment setting is as follows. The training dataset includes 256 samples from the data distribution. The classical DDPM involves 16k steps. We also implement the mixDDPM, which is our new model and will be introduced in Section 4, on the training dataset with the same neural network architecture, the same size of network parameters, the same training steps and the same random seeds. We calculate the $\mathcal{W}_{1}$ distance and the Kolmogorov–Smirnov (K-S) statistics MasseyJr (1951); Fasano & Franceschini (1987); Berger & Zhou (2014) between the finite-sample distributions of the generated samples and the true data distribution. In addition, we draw the density of both the finite-sample distributions and the data distribution for the Gaussian mixture experiment in Figure 2.

This illustration of two 1-dimensional examples shows that when training steps are not adequate, the mixDDPM model with a non-Gaussian prior has the potential to achieve significant better performance than the classical DDPM model, by making sure all else is equal.

## 4 Mixed Diffusion Models

We propose in this section the mixed diffusion models. Instead of maintaining Gaussian prior distribution, our model chooses the Gaussian mixture model. While still keeping the benefits of Gaussian prior distributions, i.e., the manifold hypothesis and sampling in low density area, the additional parameters enable the model to incorporate more information about the data distribution and further reduce the overall loads of the reverse process. In what follows, we first introduce how to incorporate data information into the prior distribution and an additional dispatcher. We leave more details of the mixDDPM and the mixSGM in Subsection 4.1 and Subsection 4.2, respectively.

In general, the prior distributions in the mixed diffusion models belong to a class of Gaussian mixture distributions with centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$. The number of centers $K$, as well as the specific values of the centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$, are predetermined before training and sampling in our model. These parameters can be flexibly chosen by users of the model, either through domain knowledge, or through various analysis methods. There is no need to concern whether the choices are optimal or not. Instead, such choices only need to contain some partial structure information of the data known by users. For instance, users may employ clustering techniques in some particular low-dimensional spaces of the data or use labels on some dimensions of the data. Additional discussions on possible center-selection methods are provided in the Appendix.

Now for each data sample ${\mathbf{x}}_{0}\sim p_{\mathrm{data}}$, a dispatcher $D:\mathbb{R}^{d}\to\{1,2,\cdots,K\}$ assigns ${\mathbf{x}}_{0}$ to one of the center. In the context of this work, the dispatcher is defined in the following way:

$D({\mathbf{x}})=\underset{i}{\arg\min}\{d({\mathbf{x}},{\mathbf{c}}_{i})\},$ | (8) |

where $d(\cdot,\cdot)$ is a distance metric. For example, we can set it to be $l_{2}$ distance. In other words, the dispatcher assigns the data ${\mathbf{x}}_{0}$ to the nearest center.

### 4.1 The mixDDPM

We first introduce the forward process. Given data sample ${\mathbf{x}}_{0}$, we suppose the dispatcher assigns it to the $j$-th center, i.e, $D({\mathbf{x}}_{0})=j$. Similar to DDPM, the forward process is a discrete-time Markov chain. Conditional on ${\mathbf{x}}_{t-1}$ and $D({\mathbf{x}}_{0})=j$, the distribution of ${\mathbf{x}}_{t}$ is given by

${\mathbf{x}}_{t}|{\mathbf{x}}_{t-1}\sim\mathcal{N}\left(\sqrt{1-\beta_{t}}({%\mathbf{x}}_{t-1}-{\mathbf{c}}_{j})+{\mathbf{c}}_{j},\beta_{t}\bm{I}\right),%\quad t=1,2,\cdots,T.$ | (9) |

In other words, the process ${\mathbf{x}}_{t}-{\mathbf{c}}_{j}$ follows the same transition density as (1). As a result, the marginal distribution of ${\mathbf{x}}_{t}$, conditional on ${\mathbf{x}}_{0}$ assigned to the j-th center, is

${\mathbf{x}}_{t}|{\mathbf{x}}_{0}\sim\mathcal{N}\left(\sqrt{\overline{\alpha}_%{t}}({\mathbf{x}}_{0}-{\mathbf{c}}_{j})+{\mathbf{c}}_{j},(1-\overline{\alpha}_%{t})\bm{I}\right),\quad t=1,2,\cdots,T.$ | (10) |

As $t$ increases, the distribution of ${\mathbf{x}}_{t}$ gradually converges to $\mathcal{N}({\mathbf{c}}_{j},\bm{I})$. That said, the prior distribution conditional on $D({\mathbf{x}}_{0})=j$ is a Gaussian distribution centered at ${\mathbf{c}}_{j}$ with unit variance. Hence, the prior distribution is a Gaussian mixture model that can be represented as $\sum_{i=1}^{K}p_{i}\mathcal{N}({\mathbf{c}}_{i},\bm{I})$, where $p_{i}$ is the proportion of data that are assigned to the i-th center.

To learn the noise given the forward process, the mixed DDPM utilizes a neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}:\mathbb{R}^{d}\times\{1,2,\cdots,T\}%\times\{1,2,\cdots,K\}\to\mathbb{R}^{d}$. The neural network takes three inputs: the state of the forward process ${\mathbf{x}}_{t}\in\mathbb{R}^{d}$, the time $t\in\{1,2,\cdots,T\}$ and the center number $D\in\{1,2,\cdots,K\}$. Our method adopts the U-Net architecture, as suggested by Ho etal. (2020); Rombach etal. (2022); Ramesh etal. (2022). Similar to (2), the mixed DDPM adopts the following training objective:

$\mathcal{L}_{\mathrm{mix}}^{\mathrm{DDPM}}:=\sum_{t=1}^{T}\mathbb{E}_{{\mathbf%{x}}_{0},{\mathbf{\epsilon}}}\left[\omega_{t}\cdot\left\|{\mathbf{\epsilon}}-{%\mathbf{\epsilon}}_{{\mathbf{\theta}}}({\mathbf{x}}_{t},t,j)\right\|_{2}^{2}%\right],\quad\text{with}\quad{\mathbf{x}}_{t}=\sqrt{\overline{\alpha}_{t}}({%\mathbf{x}}_{0}-{\mathbf{c}}_{j})+{\mathbf{c}}_{j}+\sqrt{1-\overline{\alpha}_{%t}}{\mathbf{\epsilon}}.$ | (11) |

The training process can be viewed as solving the optimization problem $\underset{{\mathbf{\theta}}}{\min}\mathcal{L}_{\mathrm{mix}}^{\mathrm{DDPM}}$ by the stochastic gradient descent method.

During the reverse sampling process, the mixed DDPM first samples $\tilde{{\mathbf{x}}}_{T}\sim\sum_{i=1}^{K}p_{i}\mathcal{N}({\mathbf{c}}_{i},%\bm{I})$. To do so, it first samples $j$ from $\{1,2,\cdots,K\}$ such that $\mathbb{P}(j=i)=p_{i}$. Then, it proceeds to sample $\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{j},\bm{I})$. The transition density for $\tilde{{\mathbf{x}}}_{t-1}$, conditional on $\tilde{{\mathbf{x}}}_{t}$, is given by

$\tilde{{\mathbf{x}}}_{t-1}|\tilde{{\mathbf{x}}}_{t}\sim\mathcal{N}(\bm{\mu}_{{%\mathbf{\theta}}}(\tilde{{\mathbf{x}}}_{t},t),\beta_{t}\bm{I}),\quad t=1,2,%\cdots,T,$ | (12) |

where

$\displaystyle\bm{\mu}_{{\mathbf{\theta}}}({\mathbf{x}},t)=\frac{1}{\sqrt{1-%\beta_{t}}}\left({\mathbf{x}}-{\mathbf{c}}_{j}-\frac{\beta_{t}}{\sqrt{1-%\overline{\alpha}_{t}}}{\mathbf{\epsilon}}_{{\mathbf{\theta}}}({\mathbf{x}},t)%\right)+{\mathbf{c}}_{j}.$ | (13) |

We summarize the training process and the sampling process for the mixed DDPM in Algorithm 1 and Algorithm 2 below.

Input: samples ${\mathbf{x}}_{0}$ from the data distribution, un-trained neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}$, time horizon $T$, noise schedule $\beta_{1},\cdots,\beta_{T}$, number of centers $K$ and the centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$

Output: Trained neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}$

repeat

Get data ${\mathbf{x}}_{0}$

Find center $j=D({\mathbf{x}}_{0})$

Sample $t\sim U\{1,2,\cdots,T\}$ and ${\mathbf{\epsilon}}\sim\mathcal{N}(\bm{0},\bm{I})$

${\mathbf{x}}_{t}\leftarrow\sqrt{\overline{\alpha}_{t}}({\mathbf{x}}_{0}-{%\mathbf{c}}_{j})+{\mathbf{c}}_{j}+\sqrt{1-\overline{\alpha}_{t}}{\mathbf{%\epsilon}}$

$\mathcal{L}\leftarrow\omega_{t}\left\|{\mathbf{\epsilon}}-{\mathbf{\epsilon}}_%{{\mathbf{\theta}}}({\mathbf{x}}_{t},t,j)\right\|_{2}^{2}$

Take a gradient descent step on $\nabla_{{\mathbf{\theta}}}\mathcal{L}$

untilConverged or training resource/time limit is hit

Input: Trained neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}$, center weights $p_{1},\cdots,p_{K}$, centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$

Sample $j\in\{1,\cdots,K\}$ with $\mathbb{P}(j=i)=p_{i}$ for $i=1,\cdots,K$

Sample $\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{j},\bm{I})$

for$t=T$ to $1$do

Calculate $\bm{\mu}_{{\mathbf{\theta}}}(\tilde{{\mathbf{x}}}_{t},t)=\frac{1}{\sqrt{1-%\beta_{t}}}\left(\tilde{{\mathbf{x}}}_{t}-{\mathbf{c}}_{j}-\frac{\beta_{t}}{%\sqrt{1-\overline{\alpha}_{t}}}{\mathbf{\epsilon}}_{{\mathbf{\theta}}}(\tilde{%{\mathbf{x}}}_{t},t)\right)+{\mathbf{c}}_{j}$

Sample $\tilde{{\mathbf{x}}}_{t-1}\sim\mathcal{N}(\bm{\mu}_{{\mathbf{\theta}}}(\tilde{%{\mathbf{x}}}_{t},t),\beta_{t}\bm{I})$

endfor

Return $\tilde{{\mathbf{x}}}_{0}$

Before ending this section, we illustrate why the mixDDPM improves the overall efficiency of the DDPM. We first define the reverse effort for the DDPM and the mixDDPM by

$\displaystyle\mathrm{ReEff}^{\mathrm{DDPM}}:=$ | $\displaystyle\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}(\bm{0},\bm{I}%),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|{\mathbf{x}}_{0}-%\tilde{{\mathbf{x}}}_{T}\|^{2}\right],$ | (14) | ||

$\displaystyle\mathrm{ReEff}^{\mathrm{DDPM}}_{\mathrm{mix}}:=$ | $\displaystyle\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_%{D({\mathbf{x}}_{0})},\bm{I})}\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm%{data}}}\left[\|{\mathbf{x}}_{0}-\tilde{{\mathbf{x}}}_{T}\|^{2}\right],$ | (15) |

where $\bar{p}_{\mathrm{data}}$ is the empirical distribution over the given data. We now explain the definition of the reverse effort. The forward process gradually adds noise to the initial data ${\mathbf{x}}_{0}$, until it converges to the prior distribution. On the contrary, the reverse process aims to recover ${\mathbf{x}}_{0}$ given $\tilde{{\mathbf{x}}}_{T}$ as input. Hence, we evaluate the distance between ${\mathbf{x}}_{0}$ and $\tilde{{\mathbf{x}}}_{T}$ and define its expectation as the reverse effort. One noteworthy fact of the reverse effort for the mixDDPM is that ${\mathbf{x}}_{0}$ and $\tilde{{\mathbf{x}}}_{T}$ are not independent. This can be attributed to the dispatcher, which assigns ${\mathbf{x}}_{0}$ to the $D({\mathbf{x}}_{0})$-th center. We present the relationship between the two reverse efforts defined by (14) and (15) in Proposition 1.

###### Proposition 1.

Given the cluster number $K$ and the cluster centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$, we define $X_{i}=\{{\mathbf{x}}:D({\mathbf{x}})=i\}$ and $p_{i}=\frac{|X_{i}|}{\sum_{j=1}^{K}|X_{j}|}$ for $i=1,2,\cdots,K$. Under the assumption that ${\mathbf{c}}_{i}$ is the arithmetic mean of $X_{i}$, we have

$\mathrm{ReEff}^{\mathrm{DDPM}}_{\mathrm{mix}}=\mathrm{ReEff}^{\mathrm{DDPM}}-%\sum_{i=1}^{K}p_{i}\|{\mathbf{c}}_{i}\|^{2}.$ | (16) |

Proposition 1 shows that the mixDDPM requires less reverse effort compared to the classical DDPM. In addition, this reduction can be quantified as a weighted average of the $l_{2}$-norm of the centers. This reduction can be understood in the following way. We have discussed in Section 3 that the prior distribution of the DDPM contains no information about the data distribution. In contrast, the prior distribution of the mixDDPM retains some data information through the choice of the centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$. This retained data information, together with the dispatcher, helps reduce the reverse effort by providing guidance on where to initiate the reverse process. Although this reduction may not significantly affect sampling quality when the neural network is well-trained, it can lead to potential improvements when training is insufficient.

### 4.2 The mixSGM

Suppose a given data ${\mathbf{x}}_{0}$is assigned to the $j$-th center by the dispatcher, i.e, $D({\mathbf{x}}_{0})=j$. The mixed SGM modifies the forward SDE from (4) to

$d{\mathbf{x}}_{t}=f_{t}({\mathbf{x}}_{t}-{\mathbf{c}}_{j})dt+g_{t}d{\mathbf{w}%}_{t},$ | (17) |

Equivalently, ${\mathbf{x}}_{t}-{\mathbf{c}}_{j}$ is the OU-process that follows (4). Then, the marginal distribution of ${\mathbf{x}}_{t}$, conditional on ${\mathbf{x}}_{0}$ and $D({\mathbf{x}}_{0})=j$, can be calculated as $\mathcal{N}(\alpha_{t}{\mathbf{x}}_{0}+{\mathbf{c}}_{j},\sigma_{t}^{2}\bm{I})$. As the time horizon $T$ increases, the prior distribution, conditional on $D({\mathbf{x}}_{0})=j$, is $\mathcal{N}(\alpha_{T}{\mathbf{x}}_{0}+{\mathbf{c}}_{j},\sigma_{T}^{2}\bm{I})$, which can be approximated by $\mathcal{N}({\mathbf{c}}_{j},\sigma_{T}^{2}\bm{I})$ if $\alpha_{T}$ is small enough. Hence, the unconditional prior distribution for the mixed SGM is chosen to be $\sum_{i=1}^{K}p_{i}\mathcal{N}({\mathbf{c}}_{i},\sigma_{T}^{2}\bm{I})$, where $p_{i}$ is the proportion of data that are assigned to the $i$-th center. Below we draw a table to summarize and compare the prior distributions of both classical and mixed diffusion models.

Prior distribution | DDPM | SGM |
---|---|---|

Classical | $\mathcal{N}(\bm{0},\bm{I})$ | $\mathcal{N}(\bm{0},\sigma_{T}^{2}\bm{I})$ |

Mixed (our model) | $\sum_{i=1}^{K}p_{i}\mathcal{N}({\mathbf{c}}_{i},\bm{I})$ | $\sum_{i=1}^{K}p_{i}\mathcal{N}({\mathbf{c}}_{i},\sigma_{T}^{2}\bm{I})$ |

Again, we adopt the U-Net architecture to define the noise model ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}:\mathbb{R}^{d}\times(0,T)\times\{1,2,%\cdots,K\}$. Following (6), the training process can be modeled as solving the following optimization problem by stochastic gradient descent: $\underset{{\mathbf{\theta}}}{\min}\mathcal{L}_{\mathrm{mix}}^{\mathrm{SGM}}$, where $\mathcal{L}_{\mathrm{mix}}^{\mathrm{SGM}}=\int_{0}^{T}\omega_{t}\cdot\mathbb{E%}_{{\mathbf{x}}_{0},{\mathbf{\epsilon}}}\left[\|{\mathbf{\epsilon}}_{{\mathbf{%\theta}}}({\mathbf{x}}_{t},t,j)-{\mathbf{\epsilon}}\|_{2}^{2}\right]dt$ with ${\mathbf{x}}_{t}=\alpha_{t}{\mathbf{x}}_{0}+{\mathbf{c}}_{j}+\sigma_{t}{%\mathbf{\epsilon}}$ and $j=D({\mathbf{x}}_{0})$.

Finally, the reverse sampling process can be modeled as both reverse SDE and probability ODE. Similar to what the mixed DDPM has done in Section 4.1, the mixed SGM first samples $j$ from $\{1,2,\cdots,K\}$ according to the weights $\mathbb{P}(j=i)=p_{i}$ and then samples $\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{j},\bm{I})$. The corresponding reverse SDE is given by

$\displaystyle d\tilde{{\mathbf{x}}}_{t}=\left(f_{t}(\tilde{{\mathbf{x}}}_{t}-{%\mathbf{c}}_{j})+\frac{g_{t}^{2}}{\sigma_{t}}{\mathbf{\epsilon}}_{{\mathbf{%\theta}}}(\tilde{{\mathbf{x}}}_{t},t)\right)dt+g_{t}d\tilde{{\mathbf{w}}}_{t},$ | (18) |

For ease of exposition, we present the training process and the sampling process for the mixSGM in Appendix A.2. We also present the following Proposition to illustrate the effort-reduction effect of the mixSGM.

###### Proposition 2.

Define

$\displaystyle\mathrm{ReEff}^{\mathrm{SGM}}:=$ | $\displaystyle\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}(\bm{0},\sigma%_{T}^{2}\bm{I}),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|{\mathbf%{x}}_{0}-\tilde{{\mathbf{x}}}_{T}\|^{2}\right],$ | (19) | ||

$\displaystyle\mathrm{ReEff}^{\mathrm{SGM}}_{\mathrm{mix}}:=$ | $\displaystyle\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_%{D({\mathbf{x}}_{0})},\sigma_{T}^{2}\bm{I})}\mathbb{E}_{{\mathbf{x}}_{0}\sim%\bar{p}_{\mathrm{data}}}\left[\|{\mathbf{x}}_{0}-\tilde{{\mathbf{x}}}_{T}\|^{2%}\right],$ | (20) |

where $\bar{p}_{\mathrm{data}}$ is the empirical distribution over the given dataGiven the cluster number $K$ and the cluster centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$, we define $X_{i}=\{{\mathbf{x}}:D({\mathbf{x}})=i\}$ and $p_{i}=\frac{|X_{i}|}{\sum_{j=1}^{K}|X_{j}|}$ for $i=1,2,\cdots,K$. Under the assumption that ${\mathbf{c}}_{i}$ is the arithmetic mean of $X_{i}$, we have

$\mathrm{ReEff}^{\mathrm{SGM}}_{\mathrm{mix}}=\mathrm{ReEff}^{\mathrm{SGM}}-%\sum_{i=1}^{K}p_{i}\|{\mathbf{c}}_{i}\|^{2}.$ | (21) |

Proposition 2 provides a quantitative measurement of efforts reduction brought by the mixSGM, compared to the classical SGM. The amount of effort reduction reflects the amount of information provided by the structured prior distribution. One insight shown by Proposition 2 is that the effect of the effort reduction depends on $\sigma_{T}$, the standard deviation of the prior distribution in SGM. When $\sigma_{T}$ is very large, the impact of the reduction term $\sum_{i=1}^{K}p_{i}\|{\mathbf{c}}_{i}\|^{2}$ is minimal because both of the reverse efforts for the SGM and the mixSGM become significantly large. On the contrary, when $\sigma_{T}$ is moderate, the reduction effect becomes evident.

## 5 Numerical Experiments

### 5.1 Oakland Call Center $\&$ Public Work Service Requests Dataset

The Oakland Call Center $\&$ Public Works Service Requests Dataset is an open-source dataset containing service requests received by the Oakland Call Center. We preprocess the dataset to obtain the number of daily calls from July 1, 2009, to December 5, 2019. To learn the distribution of daily calls, we extract the number of daily calls from the 1,000th to the 2,279th day (a total of 1,280 days) since July 5, 2009, as the training data, and set the number of daily calls from the 2,280th to the 2,919th day (a total of 640 days) as the testing data. Since operational datasets often exhibit non-stationarity in terms of varying means, variances, and increasing (or decreasing) trends, we first conduct linear regression on the training data to eliminate potential trends and then normalize the data. We compare the effects of DDPM with mixDDPM. As the training data is one-dimensional, we utilize fully connected neural networks with the same architecture and an equal number of neurons. We train both models for 8k steps and independently generate 640 samples.

In Figure 3, we plot the density of the training data and the data generated by both DDPM and mixDDPM. We also calculate the $\mathcal{W}_{1}$ distances and the K-S statistics between the generated samples and the testing data, as shown in Table LABEL:Oak311;data. The benchmark column is calculated by comparing the training data to the generated data, serving as a measurement of the distributional distances between the training and testing data. Relative errors are calculated as the difference between the metric values of the benchmark and the models, expressed as a fraction of the benchmark’s metric value.

Benchmark | DDPM | mixDDPM | |
---|---|---|---|

$\mathcal{W}_{1}$ Distance | 0.172 | 0.374 | 0.170 |

$\mathcal{W}_{1}$ relative error | 1.174 | -0.012 | |

K-S statistics | 0.112 | 0.277 | 0.105 |

K-S relative error | 1.473 | -0.063 |

### 5.2 Experiments on EMNIST

In this section, we compare mixDDPM with DDPM using the EMNIST dataset Cohen etal. (2017), an extended version of MNIST that includes handwritten digits and characters in the format of $1\times 28\times 28$. We extract the first $N$ images of digits 0, 1, 2, and 3 to form the training dataset, with $N$ values set to 64, 128, and 256. We select U-Net as the model architecture to learn the noise during training. As an illustrative example, we present the generated samples for $N=128$ in Figure 4 below.

DDPM

mixDDPM

When training resources are limited, i.e., the number of training steps is relatively small, mixDDPM performs better than DDPM. Specifically, when the training step count is 0.4k, approximately one-quarter of the images generated by DDPM are difficult to identify visually, whereas only 10$\%$ of the images generated by mixDDPM are hard to identify. As the training step count increases to 1.6k, the sample quality of both DDPM and mixDDPM becomes visually comparable. This observation suggests that mixDDPM significantly improves the visual quality of the samples compared to DDPM when training resources are constrained. More experimental results, including variations in the size of the training data and the number of training steps, can be found in Appendix C.2. In addition, experiments for SGM and mixSGM can be found in Appendix C.2.

### 5.3 Experiments on CIFAR10

We test our model on CIFAR10, a dataset consisting of images with dimensions of $3\times 32\times 32$. We extract the first 2,560 images from three categories: dog, cat, and truck. These 7,680 images are fixed as the training data. During training, we use the same model architecture and noise schedule for both DDPM vs. mixDDPM and SGM vs. mixSGM to minimize the influence of other variables. We present the Fréchet Inception Distance (FID) Heusel etal. (2017) for the generated samples and the improvement ratio (Impr. Ratio) in Table 5 and Table 6. The improvement ratio is calculated as the difference between the FID for DDPM/SGM and the FID for mixDDPM/mixSGM, expressed as a fraction of the FID for DDPM/SGM. Additionally, we provide a comparison of generated samples in Figure 5.

Model \Training Steps | 180k | 240k | 300k | 360k | 420k | 480k | 540k | 600k |
---|---|---|---|---|---|---|---|---|

DDPM | 71.97 | 49.11 | 44.52 | 38.30 | 41.34 | 34.83 | 28.61 | 33.04 |

mixDDPM | 35.84 | 23.43 | 20.78 | 18.15 | 16.43 | 13.82 | 14.80 | 12.88 |

Impr. Ratio | 0.50 | 0.52 | 0.53 | 0.47 | 0.60 | 0.60 | 0.48 | 0.61 |

Model \Training Steps | 180k | 240k | 300k | 360k | 420k | 480k | 540k | 600k |
---|---|---|---|---|---|---|---|---|

SGM | 65.82 | 45.55 | 49.94 | 35.22 | 34.88 | 24.58 | 28.42 | 20.46 |

mixSGM | 62.41 | 40.52 | 36.38 | 22.66 | 24.25 | 16.93 | 17.81 | 21.89 |

Impr. Ratio | 0.05 | 0.11 | 0.27 | 0.36 | 0.30 | 0.31 | 0.37 | -0.07 |

The results on CIFAR10 demonstrate that mixed diffusion models with Gaussian mixture priors generally achieve smaller FID scores (approximately 60$\%$ lower for mixDDPM and 3$\%$ lower for mixSGM) and better sample quality. The reduced FID and improved sample quality can be attributed to the utilization of the data distribution. By identifying suitable centers for the data distribution, the reverse process can begin from these centers instead of the zero point, thereby reducing the effort required during the reverse process. This leads to the improvements observed in the numerical results. Further implementation details and additional experimental results are provided in Appendix A.4 and C.3, respectively.

## 6 Extensions

As discussed in Section 4.2, the variance of each Gaussian component in the prior distribution of the mixSGM can be any arbitrary positive value, denoted by $\sigma_{T}^{2}$, and is not necessarily constrained to 1. In this section, we incorporate data-driven variance estimation for each component and provide numerical results to demonstrate the improvements achieved through variance estimation.

Given the number of components $K$, the parametric estimation of the prior distribution can be formalized as

$\underset{{\mathbf{c}}_{i},\sigma_{i}}{\min}\,\mathrm{ReEff}^{\mathrm{SGM}}_{%\mathrm{mix}+\mathrm{var}}:=\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N%}({\mathbf{c}}_{D({\mathbf{x}}_{0})},\sigma_{D({\mathbf{x}}_{0})}^{2}\bm{I})}%\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|{\mathbf{x}}_{%0}-\tilde{{\mathbf{x}}}_{T}\|^{2}\right].$ | (22) |

Classical methods including Expectation Maximization algorithm Dempster etal. (1977) can be applied to solve the optimization problem (22). In addition, we provide a simpler method to estimate the variances based on the dispatcher $D$. To be more specific, we define

$\hat{\sigma_{i}}^{2}=\frac{1}{|\{{\mathbf{x}}:D({\mathbf{x}})=i\}|}\sum_{D({%\mathbf{x}})=i}\frac{1}{d}\|{\mathbf{x}}-{\mathbf{c}}_{i}\|_{2}^{2}\text{ for %}i=1,2,\cdots,d,$ | (23) |

where $d$ is the dimension of the state space.

With the given variance estimations $\sigma_{i}$, the forward SDE for the model starting from data samples ${\mathbf{x}}_{0}$ is given by

$d{\mathbf{x}}_{t}=f_{t}({\mathbf{x}}_{t}-{\mathbf{c}}_{j})dt+\sigma_{j}g_{t}d{%\mathbf{w}}_{t},\,j=D({\mathbf{x}}_{0}).$ | (24) |

Following the notations in Section 4.2, the training procedure is to solve the optimization problem:

$\underset{{\mathbf{\theta}}}{\min}\mathcal{L}_{\mathrm{mix+var}}^{\mathrm{SGM}%}=\int_{0}^{T}\omega_{t}\cdot\mathbb{E}_{{\mathbf{x}}_{0},{\mathbf{\epsilon}}}%\left[\|{\mathbf{\epsilon}}_{{\mathbf{\theta}}}({\mathbf{x}}_{t},t,j)-{\mathbf%{\epsilon}}\|_{2}^{2}\right]dt,$ | (25) |

where ${\mathbf{x}}_{t}=\alpha_{t}{\mathbf{x}}_{0}+{\mathbf{c}}_{D({\mathbf{x}}_{0})}%+\sigma_{t}\sigma_{D({\mathbf{x}}_{0})}{\mathbf{\epsilon}}$. The current prior distribution can be written as $\sum_{i=1}^{K}p_{i}\mathcal{N}({\mathbf{c}}_{i},\sigma_{i}^{2}\bm{I})$, where $p_{i}$ is the proportion of data that are assigned to the $i$-th center, as defined in Section 4. Moreover, the reverse SDE is given by

$\displaystyle d\tilde{{\mathbf{x}}}_{t}=\left(f_{t}(\tilde{{\mathbf{x}}}_{t}-{%\mathbf{c}}_{j})+\frac{g_{t}^{2}\sigma_{j}^{2}}{\sigma_{t}}{\mathbf{\epsilon}}%_{{\mathbf{\theta}}}(\tilde{{\mathbf{x}}}_{t},t)\right)dt+g_{t}\sigma_{j}d%\tilde{{\mathbf{w}}}_{t}$ | (26) |

given $\tilde{{\mathbf{x}}}_{T}$ comes from the $j$-th component, i.e, $\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{j},\sigma_{j}^{2}\bm{I})$. For ease of exposition, we abbreviate the above mixSGM with variance estimation to mixSGM+var. Following the same experiment settings in Section 5.3, we compare FID score among the SGM, the mixSGM and the mixSGM+var in Table 7 below. All the Improvement Ratio (Impr. Ratio) are calculated with respect to the SGM.

Model \Training Steps | 180k | 240k | 300k | 360k | 420k | 480k | 540k | 600k |
---|---|---|---|---|---|---|---|---|

SGM | 65.82 | 45.55 | 49.94 | 35.22 | 34.88 | 24.58 | 28.42 | 20.46 |

mixSGM | 62.41 | 40.52 | 36.38 | 22.66 | 24.25 | 16.93 | 17.81 | 21.89 |

Impr. Ratio | 0.05 | 0.11 | 0.27 | 0.36 | 0.30 | 0.31 | 0.37 | -0.07 |

mixSGM+var | 51.22 | 36.17 | 29.58 | 22.17 | 18.05 | 16.65 | 15.73 | 13.09 |

Impr. Ratio | 0.22 | 0.21 | 0.41 | 0.37 | 0.48 | 0.32 | 0.45 | 0.36 |

The results in Table 7 indicate that mixSGM+var consistently achieves lower FID scores compared to mixSGM. This finding further demonstrates the efficacy of the mixed diffusion model for image generation tasks, as the variance estimation method proposed in (23) requires minimal computation even in high-dimensional state spaces.

## 7 Conclusion

In this work, we propose and theoretically analyze a class of mixed diffusion models, where the prior distribution is chosen as mixed Gaussian distribution. The goal is to allow users to flexibly incorporate structured information or domain knowledge of the data into the prior distribution. The proposed model is shown to have advantageous comparative performance particularly when the training resources are limited. For future work, we plan to further the theoretical analysis and examine the performance of mixed diffusion models with data of different modalities.

## References

- Anderson (1982)BrianDO Anderson.Reverse-time diffusion equation models.
*Stochastic Processes and their Applications*, 12(3):313–326, 1982. - Arts etal. (2023)Marloes Arts, Victor GarciaSatorras, Chin-Wei Huang, Daniel Zugner, Marco Federici, Cecilia Clementi, Frank Noé, Robert Pinsler, and Rianne vanden Berg.Two for one: Diffusion models and force fields for coarse-grained molecular dynamics.
*Journal of Chemical Theory and Computation*, 19(18):6151–6159, 2023. - Avrahami etal. (2022)Omri Avrahami, Dani Lischinski, and Ohad Fried.Blended diffusion for text-driven editing of natural images.In
*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pp. 18208–18218, 2022. - Bansal etal. (2024)Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein.Cold diffusion: Inverting arbitrary image transforms without noise.
*Advances in Neural Information Processing Systems*, 36, 2024. - Berger & Zhou (2014)VanceW Berger and YanYan Zhou.Kolmogorov–smirnov test: Overview.
*Wiley statsref: Statistics reference online*, 2014. - Cohen etal. (2017)Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre VanSchaik.Emnist: Extending mnist to handwritten letters.In
*2017 international joint conference on neural networks (IJCNN)*, pp. 2921–2926. IEEE, 2017. - Dempster etal. (1977)ArthurP Dempster, NanM Laird, and DonaldB Rubin.Maximum likelihood from incomplete data via the em algorithm.
*Journal of the royal statistical society: series B (methodological)*, 39(1):1–22, 1977. - Dhariwal & Nichol (2021)Prafulla Dhariwal and Alexander Nichol.Diffusion models beat gans on image synthesis.
*Advances in neural information processing systems*, 34:8780–8794, 2021. - Fasano & Franceschini (1987)Giovanni Fasano and Alberto Franceschini.A multidimensional version of the kolmogorov–smirnov test.
*Monthly Notices of the Royal Astronomical Society*, 225(1):155–170, 1987. - Hang & Gu (2024)Tiankai Hang and Shuyang Gu.Improved noise schedule for diffusion training.
*arXiv preprint arXiv:2407.03297*, 2024. - Haxholli & Lorenzi (2023)Etrit Haxholli and Marco Lorenzi.Faster training of diffusion models and improved density estimation via parallel score matching, 2023.URL https://arxiv.org/abs/2306.02658.
- Heusel etal. (2017)Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium.
*Advances in neural information processing systems*, 30, 2017. - Ho etal. (2020)Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin (eds.),
*Advances in Neural Information Processing Systems*, volume33, pp. 6840–6851. Curran Associates, Inc., 2020.URL https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf. - Karras etal. (2022)Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.Elucidating the design space of diffusion-based generative models.
*Advances in neural information processing systems*, 35:26565–26577, 2022. - Kaufman & Rousseeuw (2009)Leonard Kaufman and PeterJ Rousseeuw.
*Finding groups in data: an introduction to cluster analysis*.John Wiley & Sons, 2009. - Kawar etal. (2023)Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani.Imagic: Text-based real image editing with diffusion models.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 6007–6017, 2023. - Kazerouni etal. (2023)Amirhossein Kazerouni, EhsanKhodapanah Aghdam, Moein Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, and Dorit Merhof.Diffusion models in medical imaging: A comprehensive survey.
*Medical Image Analysis*, 88:102846, 2023. - Khader etal. (2023)Firas Khader, Gustav Müller-Franzes, Soroosh TayebiArasteh, Tianyu Han, Christoph Haarburger, Maximilian Schulze-Hagen, Philipp Schad, Sandy Engelhardt, Bettina Baeßler, Sebastian Foersch, etal.Denoising diffusion probabilistic models for 3d medical image generation.
*Scientific Reports*, 13(1):7303, 2023. - Kim etal. (2023)Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi.Bk-sdm: Architecturally compressed stable diffusion for efficient text-to-image generation.In
*Workshop on Efficient Systems for Foundation Models@ ICML2023*, 2023. - Kingma etal. (2021)Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho.Variational diffusion models.
*Advances in neural information processing systems*, 34:21696–21707, 2021. - Kong etal. (2020)Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro.Diffwave: A versatile diffusion model for audio synthesis.
*arXiv preprint arXiv:2009.09761*, 2020. - Leng etal. (2022)Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, XuTan, Danilo Mandic, Lei He, Xiangyang Li, Tao Qin, etal.Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis.
*Advances in Neural Information Processing Systems*, 35:23689–23700, 2022. - Lu etal. (2022a)Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.
*Advances in Neural Information Processing Systems*, 35:5775–5787, 2022a. - Lu etal. (2022b)Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.
*arXiv preprint arXiv:2211.01095*, 2022b. - MasseyJr (1951)FrankJ MasseyJr.The kolmogorov-smirnov test for goodness of fit.
*Journal of the American statistical Association*, 46(253):68–78, 1951. - Meng etal. (2021)Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon.Sdedit: Guided image synthesis and editing with stochastic differential equations.
*arXiv preprint arXiv:2108.01073*, 2021. - Mokady etal. (2023)Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.Null-text inversion for editing real images using guided diffusion models.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 6038–6047, 2023. - Nachmani etal. (2021)Eliya Nachmani, RobinSan Roman, and Lior Wolf.Non gaussian denoising diffusion models.
*arXiv preprint arXiv:2106.07582*, 2021. - Pandey etal. (2024)Kushagra Pandey, Maja Rudolph, and Stephan Mandt.Efficient integrators for diffusion generative models.In
*The Twelfth International Conference on Learning Representations*, 2024.URL https://openreview.net/forum?id=qA4foxO5Gf. - Pernias etal. (2024)Pablo Pernias, Dominic Rampas, MatsLeon Richter, Christopher Pal, and Marc Aubreville.Würstchen: An efficient architecture for large-scale text-to-image diffusion models.In
*The Twelfth International Conference on Learning Representations*, 2024.URL https://openreview.net/forum?id=gU58d5QeGv. - Podell etal. (2024)Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.SDXL: Improving latent diffusion models for high-resolution image synthesis.In
*The Twelfth International Conference on Learning Representations*, 2024.URL https://openreview.net/forum?id=di52zR8xgf. - Ramesh etal. (2022)Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.Hierarchical text-conditional image generation with clip latents.
*arXiv preprint arXiv:2204.06125*, 1(2):3, 2022. - Rombach etal. (2022)Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In
*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pp. 10684–10695, 2022. - Rousseeuw (1987)PeterJ Rousseeuw.Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.
*Journal of computational and applied mathematics*, 20:53–65, 1987. - Saharia etal. (2022)Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, EmilyL Denton, Kamyar Ghasemipour, Raphael GontijoLopes, Burcu KaragolAyan, Tim Salimans, etal.Photorealistic text-to-image diffusion models with deep language understanding.
*Advances in neural information processing systems*, 35:36479–36494, 2022. - Song etal. (2020a)Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.
*arXiv preprint arXiv:2010.02502*, 2020a. - Song & Ermon (2019)Yang Song and Stefano Ermon.Generative modeling by estimating gradients of the data distribution.
*Advances in neural information processing systems*, 32, 2019. - Song etal. (2020b)Yang Song, Jascha Sohl-Dickstein, DiederikP Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.Score-based generative modeling through stochastic differential equations.
*arXiv preprint arXiv:2011.13456*, 2020b. - Tang & Zhao (2024)Wenpin Tang and Hanyang Zhao.Score-based diffusion models via stochastic differential equations–a technical tutorial.
*arXiv preprint arXiv:2402.07487*, 2024. - Wang etal. (2023)Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, and Mingyuan Zhou.Patch diffusion: Faster and more data-efficient training of diffusion models.In
*Thirty-seventh Conference on Neural Information Processing Systems*, 2023.URL https://openreview.net/forum?id=iv2sTQtbst. - Watson etal. (2022)Daniel Watson, William Chan, Jonathan Ho, and Mohammad Norouzi.Learning fast samplers for diffusion models by differentiating through sample quality.In
*International Conference on Learning Representations*, 2022. - Wu & Li (2023)Fang Wu and StanZ Li.Diffmd: a geometric diffusion model for molecular dynamics simulations.In
*Proceedings of the AAAI Conference on Artificial Intelligence*, volume37, pp. 5321–5329, 2023. - Xue etal. (2024)Shuchen Xue, Mingyang Yi, Weijian Luo, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhi-Ming Ma.Sa-solver: Stochastic adams solver for fast sampling of diffusion models.
*Advances in Neural Information Processing Systems*, 36, 2024. - Yang etal. (2024)Shuai Yang, Yukang Chen, Luozhou Wang, Shu Liu, and Ying-Cong Chen.Denoising diffusion step-aware models.In
*The Twelfth International Conference on Learning Representations*, 2024.URL https://openreview.net/forum?id=c43FGk8Pcg. - Yen etal. (2023)Hao Yen, FrançoisG Germain, Gordon Wichern, and Jonathan LeRoux.Cold diffusion for speech enhancement.In
*ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, pp. 1–5. IEEE, 2023. - Yu etal. (2024)Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar.Efficient video diffusion models via content-frame motion-latent decomposition.In
*The Twelfth International Conference on Learning Representations*, 2024.URL https://openreview.net/forum?id=dQVtTdsvZH. - Zhang etal. (2023)Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In
*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pp. 3836–3847, 2023. - Zhang & Chen (2023)Qinsheng Zhang and Yongxin Chen.Fast sampling of diffusion models with exponential integrator.In
*The Eleventh International Conference on Learning Representations*, 2023.URL https://openreview.net/forum?id=Loek7hfb46P. - Zhao etal. (2024)Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu.Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.
*Advances in Neural Information Processing Systems*, 36, 2024. - Zheng etal. (2023)Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar.Fast sampling of diffusion models via operator learning.In
*International conference on machine learning*, pp. 42390–42402. PMLR, 2023.

## Appendix A Additional Implementation Details

### A.1 Center Selection Methods

In this section, we provide several center selection methods based on the analysis of the training data. As mentioned in Section 4, the specific method can be determined by users and does not need to be optimal.

- 1.
Data-driven clustering method. The data-driven clustering method first applies a traditional data-clustering technique Rousseeuw (1987); Kaufman & Rousseeuw (2009) to the samples from the data distribution. To be more specific, the method first calculates the average Silhouette’s coefficient for samples from the data distribution under different number of clusters. By maximizing the average Silhouette’s coefficient over different values of $K$, the method determines an optimal value of $K$. Subsequently, the method applies the k-means algorithm to find the $K$ cluster centers for the data distribution and we denote them by ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}\in\mathbb{R}^{d}$. We summarize the implementation of this method in Algorithm 3 below.

Input: Datasets $\mathcal{S}$, maximum $K$ value $K_{\max}$

Output: the number of clusters $K$ and the centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$

$SC\leftarrow[0]*(K_{\max}-1)$

for$k=2$ to $K_{\max}$do

Apply the k-means algorithm to find $k$ cluster centers ${\mathbf{c}}_{1}^{k},\cdots,{\mathbf{c}}_{k}^{k}$

for${\mathbf{x}}_{0}$ in $\mathcal{S}$do

$SC[k-2]+=\text{Silhouette's coefficient of }{\mathbf{x}}_{0}$

endfor

$K^{\prime}=\underset{k}{\arg\min}\,SC[k]$

endfor

Return $K=K^{\prime}+2$ and ${\mathbf{c}}_{1}^{K},\cdots,{\mathbf{c}}_{K}^{K}$

- 2.
Data Labeling. When samples from the data distribution have either pre-given labels or can be labeled through pre-trained classifier, the labels naturally separate the samples into several groups. Hence, the mixed diffusion models can follow the number of different labels and the centers among samples with the same label to determine the value of $K$ and ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$.

- 3.
Alternative methods. Alternatively, the number of clusters $K$ and the centers of the clusters ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}\in\mathbb{R}^{d}$ can be seen as pre-given hyperparameters that are possibly specified by domain knowledge or other preliminary data analysis.

### A.2 Algorithms for the mixSGM

Algorithms for the training and sampling process of the mixSGM are shown in Algorithm 4 and 5 below.

Input: samples ${\mathbf{x}}_{0}$ from the data distribution, un-trained neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}$, time horizon $T$, scalar function $f,g$, number of centers $K$ and the centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$

Output: Trained neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}$

Calculate $\alpha_{t}$ and $\sigma_{t}$ in closed-form

repeat

Get data ${\mathbf{x}}_{0}$

Find center $j=D({\mathbf{x}}_{0})$

Sample $t\sim U[0,T]$ and ${\mathbf{\epsilon}}\sim\mathcal{N}(\bm{0},\bm{I})$

${\mathbf{x}}_{t}\leftarrow\alpha_{t}{\mathbf{x}}_{0}+{\mathbf{c}}_{j}+\sigma_{%t}{\mathbf{\epsilon}}$

$\mathcal{L}\leftarrow\omega_{t}\left\|{\mathbf{\epsilon}}-{\mathbf{\epsilon}}_%{{\mathbf{\theta}}}({\mathbf{x}}_{t},t,j)\right\|_{2}^{2}$

Take a gradient descent step on $\nabla_{{\mathbf{\theta}}}\mathcal{L}$

untilConverged or training resource/time limit is hit

Input: Trained neural network ${\mathbf{\epsilon}}_{{\mathbf{\theta}}}$, center weights $p_{1},\cdots,p_{K}$, centers ${\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{K}$

Sample $j\in\{1,\cdots,K\}$ with $\mathbb{P}(j=i)=p_{i}$ for $i=1,\cdots,K$

Sample $\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{j},\sigma_{T}^{2}\bm{I})$

Apply numerical solvers to the reverse SDE (18).

Return $\tilde{{\mathbf{x}}}_{0}$

### A.3 Implementation Details on EMNIST

We apply the U-Net architecture to learn the noise during the training of both the original diffusion models and the mixed diffusion models. The down-sampling path consists of three blocks with progressively increasing output channels. The specific number of output channels are 32, 64, and 128. The third block incorporates attention mechanisms to capture global context. Similarly, the up-sampling path mirrors the down-sampling structure, with the first block replaced by an attention block to refine spatial details. For the mixed diffusion models, we use class embeddings to incorporate the assignments from the dispatcher and employ a data-driven clustering method, as described in Algorithm 3. The data are preprocessed with a batch size of 16.

For both DDPM and mixDDPM, we set the time step to $T=1000$ and choose the noise schedule $\beta_{t}$ as a linear function of $t$, with $\beta_{1}=0.001$ and $\beta_{1000}=0.02$. For the SGM, we select the following forward SDE:

$d\mathbf{x}_{t}=-\frac{1}{2}\beta_{t}\mathbf{x}_{t}dt+\sqrt{\beta_{t}}d\mathbf%{w}_{t}.$ | (27) |

For mixSGM, the forward SDE is defined as:

$d\mathbf{x}_{t}=-\frac{1}{2}\beta_{t}(\mathbf{x}_{t}-\mathbf{c}_{j})dt+\sqrt{%\beta_{t}}d\mathbf{w}_{t},\quad\text{where }j=D(\mathbf{x}_{0}).$ | (28) |

Here, $\beta_{t}$ is chosen to be a linear function with $\beta_{0}=0.1$ and $\beta_{1}=40$ for both SGM and mixSGM. We use the DPM solver Lu etal. (2022a) for efficient sampling.

### A.4 Implementation Details on CIFAR10

For both the original and the mixed diffusion models, we apply the U-Net architecture, which consists of a series of down-sampling and up-sampling blocks, with each block containing two layers. The down-sampling path has five blocks with progressively increasing output channels. The specific number of output channels are 32, 64, 128, 256, and 512. Among these, the fourth block integrates attention mechanisms to capture global context. Similarly, the up-sampling path mirrors the down-sampling structure, with the second block replaced by an attention block to refine spatial details. A dropout rate of 0.1 is applied to regularize the model. Specifically for the mixed diffusion models, the model utilizes class embeddings to incorporate the assignment provided by the dispatcher. Before training the neural networks, we first scale the training data to the range of $[-2,2]$ with a batch size of 128. We choose the weighting function $\omega_{t}$ in (11) to be 1, regardless of the time step. Since the images in CIFAR10 are already labeled, we adopt a data labeling method to determine the number of centers $K$ and the centers $\mathbf{c}_{1},\cdots,\mathbf{c}_{K}$.

For DDPM and mixDDPM, the noise schedules are set with $\beta_{1}=0.001$ and $\beta_{1000}=0.02$, following a linear schedule over 1000 steps. For SGM, we choose the forward SDE as:

$d\mathbf{x}_{t}=-\frac{1}{2}\beta_{t}\mathbf{x}_{t}dt+\sqrt{\beta_{t}}d\mathbf%{w}_{t}.$ | (29) |

For mixSGM, we choose the forward SDE as:

$d\mathbf{x}_{t}=-\frac{1}{2}\beta_{t}(\mathbf{x}_{t}-\mathbf{c}_{j})dt+\sqrt{%\beta_{t}}d\mathbf{w}_{t},\text{ where }j=D(\mathbf{x}_{0}).$ | (30) |

Here, $\beta_{t}$ is chosen as a linear function with $\beta_{0}=0.1$ and $\beta_{1}=40$ for both SGM and mixSGM. We set the batch size to 128 and apply the DPM solver Lu etal. (2022a) for efficient sampling.

## Appendix B Proof

###### Proof of Proposition 1.

We first calculate $\mathrm{Eff}^{\mathrm{DDPM}}$. Since ${\mathbf{x}}_{0}$ and $\tilde{{\mathbf{x}}}_{T}$ are independent, we have

$\displaystyle\mathrm{Eff}^{\mathrm{DDPM}}$ | $\displaystyle=\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|%{\mathbf{x}}_{0}\|^{2}\right]+\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal%{N}(\bm{0},\bm{I})}\left[\|\tilde{{\mathbf{x}}}_{T}\|^{2}\right]-2\mathbb{E}_{%\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}(\bm{0},\bm{I}),\;{\mathbf{x}}_{0}\sim%\bar{p}_{\mathrm{data}}}\left[{\mathbf{x}}_{0}^{T}\tilde{{\mathbf{x}}}_{T}\right]$ | |||

$\displaystyle=\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|%{\mathbf{x}}_{0}\|^{2}\right]+d,$ | (31) |

where $d$ is the dimension of the state space. On the contrary, we calculate $\mathrm{Eff}^{\mathrm{DDPM}}_{\mathrm{mix}}$ by first conditioning on the centers:

$\displaystyle\mathrm{Eff}^{\mathrm{DDPM}}_{\mathrm{mix}}=$ | $\displaystyle\mathbb{E}\left[\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{%N}({\mathbf{c}}_{D({\mathbf{x}}_{0})},\bm{I})}\mathbb{E}_{{\mathbf{x}}_{0}\sim%\bar{p}_{\mathrm{data}}}\left[\|{\mathbf{x}}_{0}-\tilde{{\mathbf{x}}}_{T}\|^{2%}\right]\Big{|}D({\mathbf{x}}_{0})=i\right]$ | |||

$\displaystyle=$ | $\displaystyle\sum_{i=1}^{K}p_{i}\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim%\mathcal{N}({\mathbf{c}}_{i},\bm{I}),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{%data}}|_{X_{i}}}\left[\|{\mathbf{x}}_{0}-\tilde{{\mathbf{x}}}_{T}\|^{2}\right]$ | |||

$\displaystyle=$ | $\displaystyle\sum_{i=1}^{K}p_{i}\Big{(}\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}%_{\mathrm{data}}|_{X_{i}}}\left[\|{\mathbf{x}}_{0}\|^{2}\right]+\mathbb{E}_{%\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{i},\bm{I})}\left[\|%\tilde{{\mathbf{x}}}_{T}\|^{2}\right]$ | |||

$\displaystyle-2\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}%}_{i},\bm{I}),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}|_{X_{i}}}\left[{%\mathbf{x}}_{0}^{T}\tilde{{\mathbf{x}}}_{T}\right]\Big{)}$ | ||||

$\displaystyle=$ | $\displaystyle\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|{%\mathbf{x}}_{0}\|^{2}\right]+d+\sum_{i=1}^{K}p_{i}\|{\mathbf{c}}_{i}\|^{2}-2%\sum_{i=1}^{K}p_{i}\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({%\mathbf{c}}_{i},\bm{I}),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}|_{X_{i}}%}\left[{\mathbf{x}}_{0}^{T}\tilde{{\mathbf{x}}}_{T}\right].$ | (32) |

Since ${\mathbf{x}}_{0}$ and $\tilde{{\mathbf{x}}}_{T}$ are independent conditioned on the subspace $X_{i}$, we obtain

$\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{i},\bm{I}),%\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}|_{X_{i}}}\left[{\mathbf{x}}_{0}^%{T}\tilde{{\mathbf{x}}}_{T}\right]=\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{%\mathrm{data}}|_{X_{i}}}\left[{\mathbf{x}}_{0}\right]^{T}\mathbb{E}_{\tilde{{%\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{i},\bm{I})}\left[\tilde{{\mathbf%{x}}}_{T}\right]={\mathbf{c}}_{i}^{T}{\mathbf{c}}_{i}=\|{\mathbf{c}}_{i}\|^{2}.$ | (33) |

Hence, the effort of the mixDDPM is given by

$\mathrm{Eff}^{\mathrm{DDPM}}_{\mathrm{mix}}=\mathbb{E}_{{\mathbf{x}}_{0}\sim%\bar{p}_{\mathrm{data}}}\left[\|{\mathbf{x}}_{0}\|^{2}\right]+d-\sum_{i=1}^{K}%p_{i}\|{\mathbf{c}}_{i}\|^{2}.$ | (34) |

Combining (31) and (34), we finish the proof for (16).∎

###### Proof of Proposition 2.

To prove Proposition 2, we first calculate $\mathrm{Eff}^{\mathrm{SGM}}$. Since ${\mathbf{x}}_{0}$ and $\tilde{{\mathbf{x}}}_{T}$ are independent, we have

$\displaystyle\mathrm{Eff}^{\mathrm{SGM}}$ | $\displaystyle=\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|%{\mathbf{x}}_{0}\|^{2}\right]+\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal%{N}(\bm{0},\sigma_{T}^{2}\bm{I})}\left[\|\tilde{{\mathbf{x}}}_{T}\|^{2}\right]%-2\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}(\bm{0},\sigma_{T}^{2}\bm%{I}),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[{\mathbf{x}}_{0}^{T}%\tilde{{\mathbf{x}}}_{T}\right]$ | |||

$\displaystyle=\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|%{\mathbf{x}}_{0}\|^{2}\right]+\sigma_{T}^{2}d,$ | (35) |

where $d$ is the dimension of the state space. On the contrary, we calculate $\mathrm{Eff}^{\mathrm{SGM}}_{\mathrm{mix}}$ by first conditioning on the centers:

$\displaystyle\mathrm{Eff}^{\mathrm{SGM}}_{\mathrm{mix}}=$ | $\displaystyle\mathbb{E}\left[\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{%N}({\mathbf{c}}_{D({\mathbf{x}}_{0})},\sigma_{T}^{2}\bm{I})}\mathbb{E}_{{%\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|{\mathbf{x}}_{0}-\tilde{{%\mathbf{x}}}_{T}\|^{2}\right]\Big{|}D({\mathbf{x}}_{0})=i\right]$ | |||

$\displaystyle=$ | $\displaystyle\sum_{i=1}^{K}p_{i}\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim%\mathcal{N}({\mathbf{c}}_{i},\sigma_{T}^{2}\bm{I}),\;{\mathbf{x}}_{0}\sim\bar{%p}_{\mathrm{data}}|_{X_{i}}}\left[\|{\mathbf{x}}_{0}-\tilde{{\mathbf{x}}}_{T}%\|^{2}\right]$ | |||

$\displaystyle=$ | $\displaystyle\sum_{i=1}^{K}p_{i}\Big{(}\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}%_{\mathrm{data}}|_{X_{i}}}\left[\|{\mathbf{x}}_{0}\|^{2}\right]+\mathbb{E}_{%\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{i},\sigma_{T}^{2}\bm{I})%}\left[\|\tilde{{\mathbf{x}}}_{T}\|^{2}\right]$ | |||

$\displaystyle-2\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}%}_{i},\sigma_{T}^{2}\bm{I}),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}|_{X_%{i}}}\left[{\mathbf{x}}_{0}^{T}\tilde{{\mathbf{x}}}_{T}\right]\Big{)}$ | ||||

$\displaystyle=$ | $\displaystyle\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}}\left[\|{%\mathbf{x}}_{0}\|^{2}\right]+\sigma_{T}^{2}d+\sum_{i=1}^{K}p_{i}\|{\mathbf{c}}%_{i}\|^{2}-2\sum_{i=1}^{K}p_{i}\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim%\mathcal{N}({\mathbf{c}}_{i},\bm{I}),\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{%data}}|_{X_{i}}}\left[{\mathbf{x}}_{0}^{T}\tilde{{\mathbf{x}}}_{T}\right].$ | (36) |

Since ${\mathbf{x}}_{0}$ and $\tilde{{\mathbf{x}}}_{T}$ are independent conditioned on the subspace $X_{i}$, we obtain

$\mathbb{E}_{\tilde{{\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{i},\bm{I}),%\;{\mathbf{x}}_{0}\sim\bar{p}_{\mathrm{data}}|_{X_{i}}}\left[{\mathbf{x}}_{0}^%{T}\tilde{{\mathbf{x}}}_{T}\right]=\mathbb{E}_{{\mathbf{x}}_{0}\sim\bar{p}_{%\mathrm{data}}|_{X_{i}}}\left[{\mathbf{x}}_{0}\right]^{T}\mathbb{E}_{\tilde{{%\mathbf{x}}}_{T}\sim\mathcal{N}({\mathbf{c}}_{i},\bm{I})}\left[\tilde{{\mathbf%{x}}}_{T}\right]={\mathbf{c}}_{i}^{T}{\mathbf{c}}_{i}=\|{\mathbf{c}}_{i}\|^{2}.$ | (37) |

Hence, the effort of the mixSGM is given by

$\mathrm{Eff}^{\mathrm{SGM}}_{\mathrm{mix}}=\mathbb{E}_{{\mathbf{x}}_{0}\sim%\bar{p}_{\mathrm{data}}}\left[\|{\mathbf{x}}_{0}\|^{2}\right]+\sigma_{T}^{2}d-%\sum_{i=1}^{K}p_{i}\|{\mathbf{c}}_{i}\|^{2}.$ | (38) |

Combining (35) and (38), we finish the proof for Proposition 2.∎

## Appendix C Additional Numerical Results

### C.1 Additional Experiment Results on Oakland Call Center Dataset

We present in this section the numerical results for SGM and mixSGM on the Oakland Call Center experiment. For this experiment, the training steps is 4k.

Benchmark | SGM | mixSGM | |
---|---|---|---|

$\mathcal{W}_{1}$ Distance | 0.172 | 0.400 | 0.228 |

$\mathcal{W}_{1}$ relative error | 1.325 | 0.326 | |

K-S statistics | 0.112 | 0.189 | 0.144 |

K-S relative error | 0.688 | 0.286 |

### C.2 Additional Experiment Results on EMNIST

DDPM

mixDDPM

DDPM

mixDDPM

SGM

mixSGM

SGM

mixSGM