Deep Generative Models

Jiawen Zhang Mar 24, 2023

Understand and create complex, unstructured data.

Autoregressive Models

Chain rule based factorization is fully general
Compact representation via conditional independence and/or neural parameterizations

Parameterize a model family

Without loss of generality, we can use chain rule for factorization

\[p(x_1,...,x_n) = p(x_1)p(x_2|x_1) \cdots p(x_n|x_1,...,x_{n-1})\]

Fully Visible Sigmoid Belief Network (FVSBN)

The conditional variables $X_i |X_1,···,X_{i−1}$ are Bernoulli with parameters

\[\hat{x}_i = p(X_i=1| x_1, \cdots, x_{i-1}; \bf{\alpha}^i) = p(X_i|x_{<i};\bf{\alpha}^i) = \sigma (\alpha^i_0 + \sum^{i-1}_{j=1} \alpha^i_j x_j)\]

Neural Autoregressive Density Estimation (NADE)

Improvement: Use one layer neural network instead of logistic regression.

\[\hat{x}_i = p(x_i | x_1, \cdots, x_{i-1}; A_i,c_i, \bf{\alpha}_i, b_i) = \sigma (\bf{\alpha}_i h_i + b_i)\]

Autoregressive models vs. autoencoders:

A vanilla autoencoder is not a generative model: it does not define a distribution over x we can sample from to generate new data points.
For autoregressive autoencoders: We need to make sure it corresponds to a valid Bayesian Network (DAG structure), i.e., we need an ordering.
- $\hat{x}_1$ cannot depend on any input $x$. Then at generation time we don’t need any input to get started.
- $\hat{x}_2$ can only depend on $x_1$

Masked Autoencoder for Distribution Estimation (MADE)

Use masks to disallow certain paths (Germain et al., 2015). Suppose ordering is $x_2, x_3, x_1$.

The unit producing the parameters for $p(x_2)$ is not allowed to depend on any input. Unit for $p(x_3|x_2)$ only on $x_2$. And so on...
For each unit in a hidden layer, pick a random integer $i$ in $[1, n − 1]$. That unit is allowed to depend only on the first $i$ inputs (according to the chosen ordering).
Add mask to preserve this invariant: connect to all units in previous layer with smaller or equal assigned number (strictly < in final layer)

Pros & Cons

Pros

Easy to evaluate likelihoods
Easy to train

Cons

Requires an ordering
Generation is sequential
Cannot learn features in an unsupervised way

Search for model parameters

Kullback-Leibler divergence (KL-divergence)

\[D(p||q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\]

$D(p||q) \geq 0$ for all $p,q$ with equality if and only if $p = q$.
Asymmetric, $D(p||q) \neq D(q||p)$.
If data comes from $p$, but use a scheme optimized for $q$, the divergence $DKL(p||q)$ is the number of extra bits need on average.

Expected log-likelihood

\[D(P_{data}||P_{\theta)} = E_{x \backsim P_{data}} [ \log \frac{P_{data}(x)}{P_{\theta}(x)} ] \\ = E_{x \backsim P_{data}} [ \log P_{data}(x)] - E_{x \backsim P_{data}} [P_{\theta}(x) ]\]

The first term does not depend on $P_{\theta}$, then minimizing KL divergence is equivalent to maximizing the expected log-likelihood

\[\arg \min_{P_{\theta}} D(P_{data} || P_{\theta}) = \arg \min_{P_{\theta}} + E_{x \backsim P_{data}} [\log P_{\theta} (x)] = \arg \max_{P_{\theta}} E_{x \backsim P_{data}} [\log P_{\theta}]\]

Empirical log-likelihood:

\[E_{\mathscr{D}} [\log P_{\theta} (x)] = \frac{1}{|\mathscr{D}|} \sum_{x \in \mathscr{D}} \log P_{\theta} (x)\]

Maximum likelihood learning:

\[\max_{P_{\theta}} \frac{1}{|\mathscr{D}|} \sum_{x \in \mathscr{D}} \log P_{\theta} (x)\]

Main idea in Monte Carlo Estimation

Express the quantity of interest as the expected value of a random variable.

\[E_{x \backsim P} [g(x)] = \sum_x g(x) P(x)\]

Generate T samples $x_1, . . . , x_T$ from the distribution $P$ with respect to which the expectation was taken.
Estimate the expected value from the samples using:

\[\hat{g} (x^1,...,x^{T}) \triangleq \frac{1}{T} \sum^{T}_{t=1} g(x^t)\] where $x^1,...,x^T$ are independent samples from $P$, $\hat{g}$ is a random variable.

Properties of the Monte Carlo Estimate

Unbiased \[E_P [\hat{g}] = E_P [g(x)]\]
Convergence \[\hat{g} = \frac{1}{T} \sum^T_{t=1} g(x^t) \rightarrow E_P [g(x)] \quad for \quad T \rightarrow \infty\]
Variance \[V_P[\hat{g}] = V_P [\frac{1}{T} \sum^T_{t=1}g(x^t)] = \frac{V_P [g(x)]}{T}\] variance of the estimator can be reduced by increasing the number of samples

Empirical Risk and Overfitting

Empirical risk minimization can easily overfit the data
Generalization: Model should generalize well to these “never-seen” samples.

Bias-Variance trade off:

If the hypothesis space is very limited, it might not be able to represent $P_{data}$, even with unlimited data (bias)
If we select a highly expressive hypothesis class, we might represent better the data. (variance)
- When we have small amount of data, multiple models can fit well, or even better than the true model. Moreover, small perturbations on $\mathscr{D}$ will result in very different estimates

Avoid overfitting

Hard constraints, e.g. by selecting a less expressive model family
- Smaller neural networks with less parameters
- Weight sharing
Soft preference for “simpler” models: Occam Razor.
Augment the objective function with regularization
Evaluate generalization performance on a held-out validation set

Latent Variable Models

Mixture of Gaussians: a Shallow Latent Variable Model

Mixture of Gaussians. Bayes net: $z \rightarrow x$.

$z \backsim \rm{Categorical}(1,...,K)$
$p(x|z=k) = \mathscr{N}(\mu_k, \Sigma_k)$

Generative process:

Pick a mixture component $k$ by sampling $z$
Generate a data point by sampling from that Gaussian

Variational Autoencoder

A mixture of an infinite number of Gaussians:

$z \backsim \mathscr{N}(0,I)$
$p(x|z) = \mathscr{N}(\mu_{\theta}(z), \Sigma_{\theta} (z))$ where $\mu_{\theta}, \Sigma_{\theta}$ are neural networks
- $ _{} (z) - (Az + c) = ((a_1 z + c_1), (a_2 z + c_2)) = (_1 (z), _2 (z))$
- $ _{} (z) = \rm{diag} (((B z + d))) = $
- $ = (A,B,c,d)$
Even though $p(x | z)$ is simple, the marginal $p(x)$ is very complex/flexible

隐变量z是一个多维高斯分布，$p(x|z)$ 是一个神经网络，也是一个高斯分布。整个隐变量模型可以视为无数个高斯分布的组合，$x$在不同的地方对应不同的高斯分布：

\[p(x) = \int p(x|z)p(z)dz\]

MLE目标是

\[\theta \leftarrow \arg \max_{\theta} \frac{1}{N} \sum_i \log (\int p_{\theta} (x_i|z)p(z)dz)\]

上述积分不可计算，我们先把积分写成期望形式，假设 $z \backsim p(z|x_i)$

\[\theta \leftarrow \arg \max_{\theta} \frac{1}{N} \sum_i E_{z \backsim p(z|x_i)} [ \log p_{\theta} (x_i|z)]\]

我们不需要对所有这都计算$p_{\theta} (x_i|z)$,只有少部分z会导致$x_i$的产生,这少部分z的分布为$p(z|x_i)$，可记为$q_i(z)$,其分布可能非常复杂（后面令其神经网络化）。 $q_i (z)$无法被用于精确计算$\log p_{\theta}(x_i)$，但可求其下界，i.e., 变分推断。

Marginal Likelihood 边缘似然

概率：已知一些参数的情况下，预测接下来观测所得到的结果。
似然性：已知的某些观测所得到的结果时，对有关事物的性质的参数进行估计。

在统计学中，边缘似然函数（marginal likelihood function），或者积分似然（integrated likelihood），是一个某些参数变量边缘化的似然函数（likelihood function）。在贝叶斯统计范畴，它可以被称作为证据或者模型证据的。

Evidence Lower Bound

Log-Likelihood function for Partially Observed Data is hard to compute:

\[\log (\sum_{z \in \mathscr{Z}} p_{\theta} (\bf{x}, \bf{z})) = \log (\sum_{z \in \mathscr{Z}} \frac{q(z)}{q(z)} p_{\theta (\bf{x}, \bf{z})}) = \log (E_{z \backsim q(z)} [ \frac{p_{\theta} (\bf{x}, \bf{z})}{q(z)} ])\]

log() is a concave function. $\log(px + (1 − p)x^′) ≥ p \log(x) + (1 − p) log(x^′)$.
Idea: use Jensen Inequality (for concave functions):$\log E[y] \geq E[\log y]$:

\[\log (E_{z \backsim q(z)} [f(z)]) = \log (\sum_z q(z)f(z)) ≥ \sum_z q(z) \log f(z)\]

Choosing $f(z)=\frac{p_{\theta}(\bf{x},\bf{z})}{q(z)}$

\[\log (E_{z \backsim q(z)} [\frac{p_{\theta}(\bf{x}, \bf{z})}{q(z)}]) ≥ E_{z \backsim q(z)} [\log (\frac{p_{\theta (\bf{x}, \bf{z})}}{q(z)})]\] Called Evidence Lower Bound (ELBO).

Variational inference

Suppose $q(z)$ is any probability distribution over the hidden variables, evidence lower bound (ELBO) holds for any $q$

\[\begin{aligned} \log p(x;\theta) &\geq \sum_z q(z) \log (\frac{p_{\theta}(\bf{x},\bf{z})}{q(z)}) \\ &= \sum_xq(z) \log p_{\theta} (\bf{x},\bf{z}) - \sum_z q(z) \log q(z) \\ &=\sum_z q(z) \log p_{\theta} (\bf{x},\bf{z}) + H(q) \end{aligned}\]

Equality holds if $q = p(z|x;\theta)$

\[\log p(x;\theta) = \sum_z q(z) \log p(z,x;\theta) + H(q)\]

In general, $\log p(x;theta) = \rm{ELBO} + D_{KL} (q(z) || p(z|x;\theta))$, the closer $q(z)$ is to the posterior $p(z|x;\theta)$ the closer the ELBO is to the ture log-likelihood.

\[\begin{aligned} \log p(x;\theta) &\geq \sum_z q(z;\phi) \log p(z,x;\theta) + H(q(z;\phi)) \\ &= \mathscr{L}(x;\theta,\phi) + D_{KL} (q(z;\phi)|| p(z|x;\theta)) \end{aligned}\]

\[\begin{aligned} \mathscr{L} = E_{z \backsim q_{\phi}(z|x)} [\log p_{\theta} (x|z) + \log p(z)] + H(q_{\phi}(z|x)) \end{aligned}\]

$\mathscr{L}(x;\theta,\phi)$为ELBO，通过最大化ELBO将下界推高。由于KL散度永远大于0，$\log p(x_i)$的值不随$q_i (z)$的变化而变化。当使$q_i (z)$逼近$p(z|x_i)$时，$\mathscr{L}(x;\theta,\phi)$能取最大值，因此我们需要最小化$D_{KL} (q(z;\phi)|| p(z|x;\theta))$

Normalizing Flow Models

Key idea behind flow models: Map simple distributions (easy to sample and evaluate densities) to complex distributions through an invertible transformation.

Typically consider parameterized densities:
- Gaussian: $X \backsim \mathscr{N} (\mu,\sigma)$ if $p_x (x)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-(x-\mu)^2 /2\sigma^2}$
- Uniform: $X \backsim \mathscr{U}(a,b)$ if $p_x (x)=\frac{1}{b-a} 1[a \leq x \leq b]$
If $X$ is a continuous random vector, we can usually represent it using its joint probability density function:
- Gaussian: if $p_x (x) = \frac{1}{\sqrt{(2\pi)^n} |\Sigma|} \exp (-\frac{1}{2} (x - \mu )^T \Sigma^{-1} (x - \mu))$

Change of Variables formula

Change of variables (1D case): If $X=f(Z)$ and $f(·)$ is monotone with inverse $Z=f^{−1}(X)=h(X)$, then: \[p_X (x) = p_Z (h(x)) |h^{'} (x)|\] Note that the ”shape” of $p_X(x)$ is different (more complex) from that of the prior $p_Z(z)$. Letting $z=h(x)=f^{−1}(x)$ we can also write \[p_X (x) = p_Z (z) \frac{1}{f^{'} (z)}\]

Generalized change of variables

For linear transformations specified via $A$, change in volume is given by the determinant of $A$
For non-linear transformations $f(·)$, the linearized change in volume is given by the determinant of the Jacobian of $f(·)$
Change of variables (General case): The mapping between $Z$ and $X$, given by $f:{\Bbb{R}}^n \mapsto \Bbb{R}^n$, is invertible such that $X = f(Z)$ and $Z = f^{−1}(X)$. \[p_{X}(x) = p_{Z} (f^{-1}(x)) |\det (\frac{\partial f^{-1} (x)}{\partial x})|\]
$x, z$ need to be continuous and have the same dimension.
For any invertible matrix $A$, $\det(A^{−1}) = \det(A)^{−1}$ \[p_{X}(x) = p_{Z} (z) |\det (\frac{\partial f (z)}{\partial z})|^{-1}\]

In a normalizing flow model, the mapping between $Z$ and $X$, given by $f_{\theta}:{\Bbb{R}}^n \mapsto \Bbb{R}^n$, is deterministic and invertible such that $X= f_{\theta}(Z)$ and $Z = f^{−1}_{\theta}(X)$. the marginal likelihood p(x) is given by \[p_{X}(x;\theta) = p_{Z} (f^{-1}_\theta(x)) |\det (\frac{\partial f_\theta^{-1} (x)}{\partial x})|\]

Normalizing: Change of variables gives a normalized density after applying an invertible transformation Flow: Invertible transformations can be composed with each other \[z_m = f^m_\theta \circ \cdots \circ f^1_\theta (z_0) = f^m_\theta (f^{m-1}_\theta (\cdots (f^1_\theta (z_0)))) \triangleq f_\theta (z_0)\]

Start with a simple distribution for $z_0$ (e.g., Gaussian)
Apply a sequence of M invertible transformations to finally obtain $x = z_M$
By change of variables \[p_{X}(x;\theta) = p_{Z} (f^{-1}_\theta(x)) \prod^M_{m=1} |\det (\frac{\partial ({f^m_\theta})^{-1} (z_m)}{\partial z_m})|\]

Learning and Inference

Learning via maximum likelihood over the dataset $\mathscr{D}$ \[\max_\theta \log p_{X}(\mathscr{D};\theta) = \sum_{x \in \mathscr{D}} \log p_{Z} (f^{-1}_\theta(x)) + \log |\det (\frac{\partial f^{-1}_\theta (x)}{\partial x})|\] Exact likelihood evaluation via inverse tranformation $x \mapsto z$ and change of variables formula Sampling via forward transformation $z \mapsto x$ \[z \backsim p_{Z} (z) \quad x = f_{\theta} (z)\] Latent representations inferred via inverse transformation (no inference network required!) \[z = f^{-1}_{\theta} (x)\]

Computing likelihoods also requires the evaluation of determinants of n × n Jacobian matrices, where n is the data dimensionality: $O(n^3)$ => Key idea: Choose tranformations so that the resulting Jacobian matrix has special structure. For example, the determinant of a triangular matrix is the product of the diagonal entries, i.e., an O(n) operation.

Invertible transformations with diagonal Jacobians

Nonlinear Independent Components Estimation (NICE)

Additive coupling layers Partition the variables $z$ into two disjoint subsets, say $z_{1:d}$ and $z_{d+1:n}$ for any $1 ≤ d < n$
- Forward mapping $z \mapsto x$
  - $x_{1:d} = z_{1:d}$ (identity transformation)
  - $x_{d+1:n} = z_{d+1:n} + m_\theta (z_{1:d})$ ($m_\theta (\cdot)$ is a neural network with parameters $\theta$, $d$ input units, and $n − d$ output units)
- Inverse mapping $x \mapsto z$
  - $z_{1:d} = x_{1:d}$ (identity transformation)
  - $z_{d+1:n} = x_{d+1:n} - m_\theta (x_{1:d})$
- Jacobian of forward mapping \[ J = \frac{\partial x}{\partial z} \left( \begin{matrix} I_d, 0 \\ \frac{\partial x_{d+1:n}}{\partial z_{1:d}} , I_{n-d} \end{matrix} \right) \\ \det (J) = 1\]
- Volume preserving transformation since determinant is 1.
Rescaling layers
- Additive coupling layers are composed together (with arbitrary partitions of variables in each layer)
- Final layer of NICE applies a rescaling transformation
- Forward mapping $z \mapsto x$: \[ x_i = s_i z_i\] where $s_i > 0$ is the scaling factor for the i-th dimension.
- Inverse mapping $x \mapsto z$: \[ z_i = \frac{x_i}{s_i}\]
- Jacobian of forward mapping: \[ J = \rm{diag} (s) \\ \det (J) = \prod^n_{i=1} s_i\]

Real-NVP: Non-volume preserving extension of NICE

Forward mapping $z \mapsto x$
- $x_{1:d} = z_{1:d}$ (identity transformation)
- $x_{d+1:n} = z_{d+1:n} \odot \exp (\alpha_\theta (z_{1:d})) + \mu_{\theta} (z_{1:d})$
- $\mu_\theta (\cdot), \alpha_\theta$ are both neural network with parameters $\theta$ $d$ input units, and $n − d$ output units, $\odot$ denotes elementwise product
Inverse mapping $x \mapsto z$
- $z_{1:d} = x_{1:d}$ (identity transformation)
- $z_{d+1:n} = (x_{d+1:n} - \mu_\theta (x_{1:d})) \odot (\exp (-\alpha_\theta (x_{1:d})))$
Jacobian of forward mapping \[J = \frac{\partial x}{\partial z} \left( \begin{matrix} I_d, 0 \\ \frac{\partial x_{d+1:n}}{\partial z_{1:d}} , \rm{diag}(\exp (\alpha_{\theta (z_{1:d})})) \end{matrix} \right) \\ \det (J) = \prod^n_{i=d+1} \exp (\alpha_\theta (z_{1:d})_i) = \exp (\sum^n_{i=d+1} \alpha_\theta (z_{1:d})_i)\]
Non-volume preserving transformation in general since determinant can be less than or greater than 1

Autoregressive Models as Normalizing Flow Models

Consider a Gaussian autoregressive model: \[ p(x) = \prod^n_{i=1} p(x_i | x_{<i})\] such that $p(x_i|x_{<i})=\mathscr{N} (\mu_i (x_1,...,x_{i-1}), \exp (\alpha_i (x_1,...,x_{i-1}))^2)$.$\mu_i (\cdot), \alpha_i (\cdot)$ are neural networks for $i > 1$ and constants for $i = 1$.
Sampler for this model:
- Sample $z_i \backsim \mathscr{N} (0,1)$ for $i=1,...,n$
- Let $x_1 = \exp (\alpha_1) z_1 + \mu_1$. Compute $\mu_2(x_1),\alpha_2 (x_1)$
- Let $x_2 = \exp (\alpha_2) z_2 + \mu_2$. Compute $\mu_3(x_1,x_2),\alpha_3 (x_1,x_2)$
- Let $x_3 = \exp (\alpha_3) z_3 + \mu_3$...
Flow interpretation: transforms samples from the standard Gaussian $(z_1,z_2,...,z_n)$ to those generated from the model $(x_1,x_2,...,x_n)$ via invertible transformations (parameterized by $\mu_i (\cdot), \alpha_i (\cdot)$

Masked Autoregressive Flow (MAF)

Forward mapping $z \mapsto x$
- Let $x_1 = \exp (\alpha_1)z_1 + \mu_1$. Compute $\mu_2(x_1), \alpha_2(x_1)$
- Let $x_2 = \exp (\alpha_2)z_2 + \mu_2$. Compute $\mu_3(x_1, x_2), \alpha_3(x_1, x_2)$
Sampling is sequential and slow (like autoregressive): $\mathscr{O}(n)$ time

Inverse mapping from $x \mapsto z$:
- Compute all $\mu_i, \alpha_i$ (can be done in parallel)
- Let $z_1 = (x_1 - \mu_1) / \exp (\alpha_1)$ (scale and shift)
- Let $z_2 = (x_2 - \mu_2) / \exp (\alpha_2)$
- Let $z_3 = (x_3 - \mu_3) / \exp (\alpha_3)$...
Jacobian is lower diagonal, hence efficient determinant computation
Likelihood evaluation is easy and parallelizable (计算$p(x_1),p(x_2|x_1),...,p(x_D|x_{1:D-1})$时可以并行)
sampling autoregressive models is slow because you must wait for all previous $x_{1:i−1}$ to be computed before computing new $x_i$
Layers with different variable orderings can be stacked

Inverse Autoregressive Flow (IAF)

Forward mapping $z \mapsto x$ (parallel):
- Sample $z_i \backsim \mathscr{N} (0,1)$ for $i=1,...,n$
- Compute all $\mu_i, \alpha_i$ (can be done in parallel)
- Let $x_1 = \exp (\alpha_1) z_1 + \mu_1$
- Let $x_2 = \exp (\alpha_2) z_2 + \mu_2$...
Inverse mapping from $x \mapsto z$ (sequential):
- Let $z_1 = (x_1 − \mu_1)/ \exp (\alpha_1)$. Compute $\mu_2(z_1), \alpha_2(z_1)$
- Let $z_2 = (x_2 − \mu_2)/ \exp (\alpha_2)$. Compute $\mu_3(z_1,z_2), \alpha_3(z_1,z_2)$
Fast to sample from, slow to evaluate likelihoods of data points (train)
Note: Fast to evaluate likelihoods of a generated point (cache $z_1,z_2,..., z_n$)

IAF vs. MAF

Computational tradeoffs
- MAF: Fast likelihood evaluation, slow sampling
- IAF: Fast sampling, slow likelihood evaluation
MAF more suited for training based on MLE, density estimation
IAF more suited for real-time generation

Energy-Based Models (EBMs)

Parameterizing probability distributions

non-negative: $p(x) \geq 0$
- $g_\theta (\bf{x}) =f_\theta(\bf{x})^2$
- $g_\theta (\bf{x}) =\exp (f_\theta(\bf{x}))$
- $g_\theta (\bf{x}) =| f_\theta(\bf{x})|$
- $g_\theta (\bf{x}) = \log(1 + \exp ( f_\theta(\bf{x})))$
sum-to-one: $\sum_x p(x) = 1$ (or $\int p(x) dx = 1$ for continuous variables) \[p_\theta (\bf{x}) = \frac{1}{\rm{Volume} (g_\theta)} g_\theta (\bf{x}) = \frac{1}{\int g_\theta (\bf{x}) d\bf{x}} g_\theta (\bf{x})\]

Example: choose $g_\theta(\bf{x})$ so that we know the volume analytically as a function of $\theta$.
- $g_{(\mu, \sigma)} (x) = e^{-\frac{(x-\mu)^2}{2\sigma^2}}$. Volume: $\int e^{-\frac{x-\mu}{2\sigma^2}}dx=\sqrt{2 \pi \sigma^2}$. (Gaussian)
- $g_\lambda (x) = e^{-\lambda x}$. Volume: $\int^{+ \infty}_0 e^{+ \lambda x} dx=\frac{1}{\lambda}$. (Exponential)
- $g_\theta (x) = h(x) \exp \{ \theta \cdot T(x)\}$. Volume: $\exp \{ A (\theta)\}$ where $A(\theta)=\log \int h(x) \exp \{\theta \cdot T(x) \} dx$. (Exponential family)
  - Normal, Poisson, exponential, Bernoulli
  - beta, gamma, Dirichlet, Wishart, etc.

Likelihood based learning

More complex models can be obtained by combining these building blocks.

Autoregressive: Products of normalized objects $p_\theta (\bf{x}) p_{\theta^{'}(\bf{x})} (\bf{y})$: \[\int_x \int_y p_\theta (\bf{x}) p_{\theta^{'}(\bf{x})} (\bf{y}) d \bf{x} d\bf{y} = \int_x p_\theta (\bf{x}) \underbrace{\int_y p_{\theta^{'}} (\bf{y}) d \bf{y} }_{=1} d \bf{x}=\int_x p_{\theta} (\bf{x}) d\bf{x} = 1\]
Latent variables: Mixtures of normalized objects $\alpha p_\theta (\bf{x}) + (1-\alpha) p_{\theta^{'}} (\bf{x})$ \[\int_x \alpha p_{\theta} (\bf{x}) + (1- \alpha) p_{\theta^{'}} (\bf{x}) d \bf{x} = \alpha + (1-\alpha) = 1\] ## Energy-based model

\[p_\theta(\bf{x}) = \frac{1}{\int \exp (f_\theta (\bf{x}))d \bf{x}} \exp (f_\theta (\bf{x})) = \frac{1}{Z(\theta)} \exp (f_\theta (\bf{x})) \]

The volume/normalization constant (partition function)

\[Z(\theta) = \int \exp (f_\theta (\bf{x}))d \bf{x}\]

$-f_\theta(\bf{x})$ is called the energy
Intuitively, configurations $\bf{x}$ with low energy (high $f_\theta(\bf{x})$) are more likely.

Pros: + Very flexible model architectures. + Stable training. + Relatively high sample quality. + Flexible composition.

Cons: + Sampling from $p_\theta (\bf{x})$ is hard + Evaluating and optimizing likelihood $p_\theta (\bf{x})$ is hard (learning is hard) + No feature learning (but can add latent variables)

Curse of dimensionality: The fundamental issue is that computing $Z(\theta)$ numerically (when no analytic solution is available) scales exponentially in the number of dimensions of $\bf{x}$.

Nevertheless, some tasks do not require knowing $Z(\theta)$. Given a trained model, many applications require relative comparisons. Hence $Z(\theta)$ is not needed. Comparing the probability of two points is easy: $p_\theta (\bf{x}')/p_\theta (\bf{x}) = \exp (f_\theta (\bf{x}') - f_\theta (\bf{x}))$

Sampling from energy-based models

Maximum likelihood training: $\max_\theta{f_\theta(x_{\rm{train}})−\log Z(\theta)}$.
- Contrastive divergence: \[ \nabla_\theta f_\theta (\bf{x}_{\rm{train}}) - \nabla_\theta \log Z(\theta) \approx \nabla_\theta f_\theta (\bf{x}_{\rm{train}}) - \nabla_\theta f_\theta(\bf{x}_{\rm{sample}})\] where $\bf{x}_{\rm{sample}} \backsim p_\theta (\bf{x})$.
Unadjusted Langevin MCMC
- $\bf{x}^0 \backsim \pi (\bf{x})$
- Repeat for $t=0,1,2,\cdots,T-1$
  - $\bf{z}^t \backsim \mathscr{N} (0,I)$
  - $\bf{x}^{t+1} = \bf{x}^t + \epsilon \nabla_x \log p_\theta (\bf{x})|_{\bf{x}=\bf{x}^t} + \sqrt{2 \epsilon} \bf{z}^t$
Properties:
- $\bf{x}^{\top}$ converges to $p_\theta(\bf{x})$ when $T \rightarrow \infty$ and $\epsilon \rightarrow 0$.
- $\nabla_x \log p_\theta (\bf{x}) = \nabla_x f_\theta (\bf{x})$ for continuous energy-based models.
- Convergence slows down as dimensionality grows.

Training without sampling

Score Function

\[s_\theta (\bf{x}) := \nabla_x \log p_\theta (\bf{x}) = \nabla_x f_\theta (\bf{x}) - \underbrace{\nabla_x \log Z(\theta)}_{=0} = \nabla_x f_\theta (\bf{x}) \]

$s_\theta (\bf{x})$ is independent of the partition function $Z(\theta)$.

Score Matching

Minimizing the Fisher divergence between $p_{data}(\bf{x})$ and the EBM $p_\theta(\bf{x})$

\[\frac{1}{2} E_{x \backsim p_{\rm{data}}} [|| \nabla_x \log p_{\rm{data}}(\bf{x}) - s_\theta (\bf{x}) ||^2_2] \\ = \frac{1}{2} E_{x \backsim p_{\rm{data}}} [|| \nabla_x \log p_{\rm{data}}(\bf{x}) - \nabla_x f_\theta (\bf{x}) ||^2_2]\]

Sample a mini-batch of datapoints $\{\bf{x}_1, \bf{x}_2,\cdots, \bf{x}_n \} \backsim p_{\rm{data}} (\bf{x})$.
Estimate the score matching loss with the empirical mean \[\frac{1}{n} \sum^n_{i=1} [\frac{1}{2} || \nabla_x \log p_\theta (\bf{x}_i) ||^2_2 + \rm{tr} (\nabla^2_x \log p_\theta (\bf{x}_i)) ] \\ = \frac{1}{n} \sum^n_{i=1} [\frac{1}{2} || \nabla_x f_\theta (\bf{x}_i) ||^2_2 + \rm{tr} (\nabla^2_x f_\theta (\bf{x}_i)) ]\]
Stochastic gradient descent.

Summary

Autoregressive models. $p_\theta(x_1, x_2,\cdots, x_n) = \prod^n_{i=1} p_\theta(x_i | x<i)$
Normalizing flow models. $p_\theta(\bf{x}) = p(\bf{z})| \det J_{f_θ} (\bf{x})|$, where $\bf{z} = f_θ(\bf{x})$
Variational autoencoders: $p_\theta(\bf{x}) = \int p(\bf{z}) p_{\theta} (\bf{x}|\bf{z}) d\bf{z}$.
Energy-based models: $p_\theta (\bf{x}) = \frac{e^{f_\theta}}{Z(\theta)}$
Pros:
- Maximum likelihood training
- Principled model comparison via likelihoods
Cons:
- Model architectures are restricted.
- Special architectures or surrogate losses to deal with intractable partition functions
Generative Adversarial Networks (GANs).
- $ E_{x∼p_{}} [D_(]$.
Pros:
- Two sample tests. Can optimize f-divergences and the Wasserstein distance.
- Very flexible model architectures.
- Samples typically have better quality
Cons:
- Require adversarial training. Training instability and mode collapse.
- No principled way to compare different models
- No principled termination criteria for training

Reference

[1] 宏观理解Diffusion过程: DDPM = 拆楼 + 建楼
[2] DDPM原论文遵循Diffusion-Original: DDPM = 自回归式VAE
[3] 和score matching串联，和DDIM延续: DDPM = 贝叶斯 + 去噪