Deep Generative Models

Understand and create complex, unstructured data.

Autoregressive Models

  • Chain rule based factorization is fully general
  • Compact representation via conditional independence and/or neural parameterizations

Parameterize a model family

Without loss of generality, we can use chain rule for factorization

\[p(x_1,...,x_n) = p(x_1)p(x_2|x_1) \cdots p(x_n|x_1,...,x_{n-1})\]

Fully Visible Sigmoid Belief Network (FVSBN)

The conditional variables \(X_i |X_1,···,X_{i−1}\) are Bernoulli with parameters

\[\hat{x}_i = p(X_i=1| x_1, \cdots, x_{i-1}; \bf{\alpha}^i) = p(X_i|x_{<i};\bf{\alpha}^i) = \sigma (\alpha^i_0 + \sum^{i-1}_{j=1} \alpha^i_j x_j)\]

Neural Autoregressive Density Estimation (NADE)

Improvement: Use one layer neural network instead of logistic regression.

\[\hat{x}_i = p(x_i | x_1, \cdots, x_{i-1}; A_i,c_i, \bf{\alpha}_i, b_i) = \sigma (\bf{\alpha}_i h_i + b_i)\]

Autoregressive models vs. autoencoders:

  • A vanilla autoencoder is not a generative model: it does not define a distribution over x we can sample from to generate new data points.
  • For autoregressive autoencoders: We need to make sure it corresponds to a valid Bayesian Network (DAG structure), i.e., we need an ordering.
    • \(\hat{x}_1\) cannot depend on any input \(x\). Then at generation time we don’t need any input to get started.
    • \(\hat{x}_2\) can only depend on \(x_1\)

Masked Autoencoder for Distribution Estimation (MADE)

MADE.png

Use masks to disallow certain paths (Germain et al., 2015). Suppose ordering is \(x_2, x_3, x_1\).

  1. The unit producing the parameters for \(p(x_2)\) is not allowed to depend on any input. Unit for \(p(x_3|x_2)\) only on \(x_2\). And so on...
  2. For each unit in a hidden layer, pick a random integer \(i\) in \([1, n − 1]\). That unit is allowed to depend only on the first \(i\) inputs (according to the chosen ordering).
  3. Add mask to preserve this invariant: connect to all units in previous layer with smaller or equal assigned number (strictly < in final layer)

Pros & Cons

Pros

  • Easy to evaluate likelihoods
  • Easy to train

Cons

  • Requires an ordering
  • Generation is sequential
  • Cannot learn features in an unsupervised way

Search for model parameters

Kullback-Leibler divergence (KL-divergence)

\[D(p||q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\]

  • \(D(p||q) \geq 0\) for all \(p,q\) with equality if and only if \(p = q\).
  • Asymmetric, \(D(p||q) \neq D(q||p)\).
  • If data comes from \(p\), but use a scheme optimized for \(q\), the divergence \(DKL(p||q)\) is the number of extra bits need on average.

Expected log-likelihood

\[D(P_{data}||P_{\theta)} = E_{x \backsim P_{data}} [ \log \frac{P_{data}(x)}{P_{\theta}(x)} ] \\ = E_{x \backsim P_{data}} [ \log P_{data}(x)] - E_{x \backsim P_{data}} [P_{\theta}(x) ]\]

The first term does not depend on \(P_{\theta}\), then minimizing KL divergence is equivalent to maximizing the expected log-likelihood

\[\arg \min_{P_{\theta}} D(P_{data} || P_{\theta}) = \arg \min_{P_{\theta}} + E_{x \backsim P_{data}} [\log P_{\theta} (x)] = \arg \max_{P_{\theta}} E_{x \backsim P_{data}} [\log P_{\theta}]\]

Empirical log-likelihood:

\[E_{\mathscr{D}} [\log P_{\theta} (x)] = \frac{1}{|\mathscr{D}|} \sum_{x \in \mathscr{D}} \log P_{\theta} (x)\]

Maximum likelihood learning:

\[\max_{P_{\theta}} \frac{1}{|\mathscr{D}|} \sum_{x \in \mathscr{D}} \log P_{\theta} (x)\]

Main idea in Monte Carlo Estimation

  1. Express the quantity of interest as the expected value of a random variable.

\[E_{x \backsim P} [g(x)] = \sum_x g(x) P(x)\]

  1. Generate T samples \(x_1, . . . , x_T\) from the distribution \(P\) with respect to which the expectation was taken.
  2. Estimate the expected value from the samples using:

\[\hat{g} (x^1,...,x^{T}) \triangleq \frac{1}{T} \sum^{T}_{t=1} g(x^t)\] where \(x^1,...,x^T\) are independent samples from \(P\), \(\hat{g}\) is a random variable.

Properties of the Monte Carlo Estimate

  • Unbiased \[E_P [\hat{g}] = E_P [g(x)]\]

  • Convergence \[\hat{g} = \frac{1}{T} \sum^T_{t=1} g(x^t) \rightarrow E_P [g(x)] \quad for \quad T \rightarrow \infty\]

  • Variance \[V_P[\hat{g}] = V_P [\frac{1}{T} \sum^T_{t=1}g(x^t)] = \frac{V_P [g(x)]}{T}\] variance of the estimator can be reduced by increasing the number of samples

Empirical Risk and Overfitting

  • Empirical risk minimization can easily overfit the data
  • Generalization: Model should generalize well to these “never-seen” samples.

Bias-Variance trade off:

  • If the hypothesis space is very limited, it might not be able to represent \(P_{data}\), even with unlimited data (bias)
  • If we select a highly expressive hypothesis class, we might represent better the data. (variance)
    • When we have small amount of data, multiple models can fit well, or even better than the true model. Moreover, small perturbations on \(\mathscr{D}\) will result in very different estimates

Avoid overfitting

  • Hard constraints, e.g. by selecting a less expressive model family
    • Smaller neural networks with less parameters
    • Weight sharing
  • Soft preference for “simpler” models: Occam Razor.
  • Augment the objective function with regularization
  • Evaluate generalization performance on a held-out validation set

Latent Variable Models

Mixture of Gaussians: a Shallow Latent Variable Model

Mixture of Gaussians. Bayes net: \(z \rightarrow x\).

  • \(z \backsim \rm{Categorical}(1,...,K)\)
  • \(p(x|z=k) = \mathscr{N}(\mu_k, \Sigma_k)\)

Generative process:

  1. Pick a mixture component \(k\) by sampling \(z\)
  2. Generate a data point by sampling from that Gaussian

Variational Autoencoder

A mixture of an infinite number of Gaussians:

  1. \(z \backsim \mathscr{N}(0,I)\)
  2. \(p(x|z) = \mathscr{N}(\mu_{\theta}(z), \Sigma_{\theta} (z))\) where \(\mu_{\theta}, \Sigma_{\theta}\) are neural networks
    • $ _{} (z) - (Az + c) = ((a_1 z + c_1), (a_2 z + c_2)) = (_1 (z), _2 (z))$
    • $ _{} (z) = \rm{diag} (((B z + d))) = $
    • $ = (A,B,c,d)$
  3. Even though \(p(x | z)\) is simple, the marginal \(p(x)\) is very complex/flexible

隐变量z是一个多维高斯分布,\(p(x|z)\) 是一个神经网络,也是一个高斯分布。整个隐变量模型可以视为无数个高斯分布的组合,\(x\)在不同的地方对应不同的高斯分布:

\[p(x) = \int p(x|z)p(z)dz\]

MLE目标是

\[\theta \leftarrow \arg \max_{\theta} \frac{1}{N} \sum_i \log (\int p_{\theta} (x_i|z)p(z)dz)\]

上述积分不可计算,我们先把积分写成期望形式,假设 \(z \backsim p(z|x_i)\)

\[\theta \leftarrow \arg \max_{\theta} \frac{1}{N} \sum_i E_{z \backsim p(z|x_i)} [ \log p_{\theta} (x_i|z)]\]

我们不需要对所有这都计算\(p_{\theta} (x_i|z)\),只有少部分z会导致\(x_i\)的产生,这少部分z的分布为\(p(z|x_i)\),可记为\(q_i(z)\),其分布可能非常复杂(后面令其神经网络化)。 \(q_i (z)\)无法被用于精确计算\(\log p_{\theta}(x_i)\),但可求其下界,i.e., 变分推断。

Marginal Likelihood 边缘似然

概率:已知一些参数的情况下,预测接下来观测所得到的结果。
似然性:已知的某些观测所得到的结果时,对有关事物的性质的参数进行估计。

在统计学中,边缘似然函数(marginal likelihood function),或者积分似然(integrated likelihood),是一个某些参数变量边缘化的似然函数(likelihood function)。在贝叶斯统计范畴,它可以被称作为证据或者模型证据的。

Evidence Lower Bound

Log-Likelihood function for Partially Observed Data is hard to compute:

\[\log (\sum_{z \in \mathscr{Z}} p_{\theta} (\bf{x}, \bf{z})) = \log (\sum_{z \in \mathscr{Z}} \frac{q(z)}{q(z)} p_{\theta (\bf{x}, \bf{z})}) = \log (E_{z \backsim q(z)} [ \frac{p_{\theta} (\bf{x}, \bf{z})}{q(z)} ])\]

  • log() is a concave function. \(\log(px + (1 − p)x^′) ≥ p \log(x) + (1 − p) log(x^′)\).
  • Idea: use Jensen Inequality (for concave functions):\(\log E[y] \geq E[\log y]\):

\[\log (E_{z \backsim q(z)} [f(z)]) = \log (\sum_z q(z)f(z)) ≥ \sum_z q(z) \log f(z)\]

Choosing \(f(z)=\frac{p_{\theta}(\bf{x},\bf{z})}{q(z)}\)

\[\log (E_{z \backsim q(z)} [\frac{p_{\theta}(\bf{x}, \bf{z})}{q(z)}]) ≥ E_{z \backsim q(z)} [\log (\frac{p_{\theta (\bf{x}, \bf{z})}}{q(z)})]\] Called Evidence Lower Bound (ELBO).

Variational inference

Suppose \(q(z)\) is any probability distribution over the hidden variables, evidence lower bound (ELBO) holds for any \(q\)

\[\begin{aligned} \log p(x;\theta) &\geq \sum_z q(z) \log (\frac{p_{\theta}(\bf{x},\bf{z})}{q(z)}) \\ &= \sum_xq(z) \log p_{\theta} (\bf{x},\bf{z}) - \sum_z q(z) \log q(z) \\ &=\sum_z q(z) \log p_{\theta} (\bf{x},\bf{z}) + H(q) \end{aligned}\]

Equality holds if \(q = p(z|x;\theta)\)

\[\log p(x;\theta) = \sum_z q(z) \log p(z,x;\theta) + H(q)\]

In general, \(\log p(x;theta) = \rm{ELBO} + D_{KL} (q(z) || p(z|x;\theta))\), the closer \(q(z)\) is to the posterior \(p(z|x;\theta)\) the closer the ELBO is to the ture log-likelihood.

\[\begin{aligned} \log p(x;\theta) &\geq \sum_z q(z;\phi) \log p(z,x;\theta) + H(q(z;\phi)) \\ &= \mathscr{L}(x;\theta,\phi) + D_{KL} (q(z;\phi)|| p(z|x;\theta)) \end{aligned}\]

or

\[\begin{aligned} \mathscr{L} = E_{z \backsim q_{\phi}(z|x)} [\log p_{\theta} (x|z) + \log p(z)] + H(q_{\phi}(z|x)) \end{aligned}\]

\(\mathscr{L}(x;\theta,\phi)\)为ELBO,通过最大化ELBO将下界推高。 由于KL散度永远大于0,\(\log p(x_i)\)的值不随\(q_i (z)\)的变化而变化。当使\(q_i (z)\)逼近\(p(z|x_i)\)时,\(\mathscr{L}(x;\theta,\phi)\)能取最大值,因此我们需要最小化\(D_{KL} (q(z;\phi)|| p(z|x;\theta))\)

Normalizing Flow Models

Key idea behind flow models: Map simple distributions (easy to sample and evaluate densities) to complex distributions through an invertible transformation.

  • Typically consider parameterized densities:
    • Gaussian: \(X \backsim \mathscr{N} (\mu,\sigma)\) if \(p_x (x)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-(x-\mu)^2 /2\sigma^2}\)
    • Uniform: \(X \backsim \mathscr{U}(a,b)\) if \(p_x (x)=\frac{1}{b-a} 1[a \leq x \leq b]\)
  • If \(X\) is a continuous random vector, we can usually represent it using its joint probability density function:
    • Gaussian: if \(p_x (x) = \frac{1}{\sqrt{(2\pi)^n} |\Sigma|} \exp (-\frac{1}{2} (x - \mu )^T \Sigma^{-1} (x - \mu))\)

Change of Variables formula

Change of variables (1D case): If \(X=f(Z)\) and \(f(·)\) is monotone with inverse \(Z=f^{−1}(X)=h(X)\), then: \[p_X (x) = p_Z (h(x)) |h^{'} (x)|\] Note that the ”shape” of \(p_X(x)\) is different (more complex) from that of the prior \(p_Z(z)\). Letting \(z=h(x)=f^{−1}(x)\) we can also write \[p_X (x) = p_Z (z) \frac{1}{f^{'} (z)}\]

Generalized change of variables

  • For linear transformations specified via \(A\), change in volume is given by the determinant of \(A\)
  • For non-linear transformations \(f(·)\), the linearized change in volume is given by the determinant of the Jacobian of \(f(·)\)
  • Change of variables (General case): The mapping between \(Z\) and \(X\), given by \(f:{\Bbb{R}}^n \mapsto \Bbb{R}^n\), is invertible such that \(X = f(Z)\) and \(Z = f^{−1}(X)\). \[p_{X}(x) = p_{Z} (f^{-1}(x)) |\det (\frac{\partial f^{-1} (x)}{\partial x})|\]
  • \(x, z\) need to be continuous and have the same dimension.
  • For any invertible matrix \(A\), \(\det(A^{−1}) = \det(A)^{−1}\) \[p_{X}(x) = p_{Z} (z) |\det (\frac{\partial f (z)}{\partial z})|^{-1}\]

In a normalizing flow model, the mapping between \(Z\) and \(X\), given by \(f_{\theta}:{\Bbb{R}}^n \mapsto \Bbb{R}^n\), is deterministic and invertible such that \(X= f_{\theta}(Z)\) and \(Z = f^{−1}_{\theta}(X)\). the marginal likelihood p(x) is given by \[p_{X}(x;\theta) = p_{Z} (f^{-1}_\theta(x)) |\det (\frac{\partial f_\theta^{-1} (x)}{\partial x})|\]

Normalizing: Change of variables gives a normalized density after applying an invertible transformation Flow: Invertible transformations can be composed with each other \[z_m = f^m_\theta \circ \cdots \circ f^1_\theta (z_0) = f^m_\theta (f^{m-1}_\theta (\cdots (f^1_\theta (z_0)))) \triangleq f_\theta (z_0)\]

  • Start with a simple distribution for \(z_0\) (e.g., Gaussian)
  • Apply a sequence of M invertible transformations to finally obtain \(x = z_M\)
  • By change of variables \[p_{X}(x;\theta) = p_{Z} (f^{-1}_\theta(x)) \prod^M_{m=1} |\det (\frac{\partial ({f^m_\theta})^{-1} (z_m)}{\partial z_m})|\]

Learning and Inference

Learning via maximum likelihood over the dataset \(\mathscr{D}\) \[\max_\theta \log p_{X}(\mathscr{D};\theta) = \sum_{x \in \mathscr{D}} \log p_{Z} (f^{-1}_\theta(x)) + \log |\det (\frac{\partial f^{-1}_\theta (x)}{\partial x})|\] Exact likelihood evaluation via inverse tranformation \(x \mapsto z\) and change of variables formula Sampling via forward transformation \(z \mapsto x\) \[z \backsim p_{Z} (z) \quad x = f_{\theta} (z)\] Latent representations inferred via inverse transformation (no inference network required!) \[z = f^{-1}_{\theta} (x)\]

Computing likelihoods also requires the evaluation of determinants of n × n Jacobian matrices, where n is the data dimensionality: \(O(n^3)\) => Key idea: Choose tranformations so that the resulting Jacobian matrix has special structure. For example, the determinant of a triangular matrix is the product of the diagonal entries, i.e., an O(n) operation.

Invertible transformations with diagonal Jacobians

Nonlinear Independent Components Estimation (NICE)

  1. Additive coupling layers Partition the variables \(z\) into two disjoint subsets, say \(z_{1:d}\) and \(z_{d+1:n}\) for any \(1 ≤ d < n\)
    • Forward mapping \(z \mapsto x\)
      • \(x_{1:d} = z_{1:d}\) (identity transformation)
      • \(x_{d+1:n} = z_{d+1:n} + m_\theta (z_{1:d})\) (\(m_\theta (\cdot)\) is a neural network with parameters \(\theta\), \(d\) input units, and \(n − d\) output units)
    • Inverse mapping \(x \mapsto z\)
      • \(z_{1:d} = x_{1:d}\) (identity transformation)
      • \(z_{d+1:n} = x_{d+1:n} - m_\theta (x_{1:d})\)
    • Jacobian of forward mapping \[ J = \frac{\partial x}{\partial z} \left( \begin{matrix} I_d, 0 \\ \frac{\partial x_{d+1:n}}{\partial z_{1:d}} , I_{n-d} \end{matrix} \right) \\ \det (J) = 1\]
    • Volume preserving transformation since determinant is 1.
  2. Rescaling layers
    • Additive coupling layers are composed together (with arbitrary partitions of variables in each layer)
    • Final layer of NICE applies a rescaling transformation
    • Forward mapping \(z \mapsto x\): \[ x_i = s_i z_i\] where \(s_i > 0\) is the scaling factor for the i-th dimension.
    • Inverse mapping \(x \mapsto z\): \[ z_i = \frac{x_i}{s_i}\]
    • Jacobian of forward mapping: \[ J = \rm{diag} (s) \\ \det (J) = \prod^n_{i=1} s_i\]

Real-NVP: Non-volume preserving extension of NICE

  • Forward mapping \(z \mapsto x\)
    • \(x_{1:d} = z_{1:d}\) (identity transformation)
    • \(x_{d+1:n} = z_{d+1:n} \odot \exp (\alpha_\theta (z_{1:d})) + \mu_{\theta} (z_{1:d})\)
    • \(\mu_\theta (\cdot), \alpha_\theta\) are both neural network with parameters \(\theta\) \(d\) input units, and \(n − d\) output units, \(\odot\) denotes elementwise product
  • Inverse mapping \(x \mapsto z\)
    • \(z_{1:d} = x_{1:d}\) (identity transformation)
    • \(z_{d+1:n} = (x_{d+1:n} - \mu_\theta (x_{1:d})) \odot (\exp (-\alpha_\theta (x_{1:d})))\)
  • Jacobian of forward mapping \[J = \frac{\partial x}{\partial z} \left( \begin{matrix} I_d, 0 \\ \frac{\partial x_{d+1:n}}{\partial z_{1:d}} , \rm{diag}(\exp (\alpha_{\theta (z_{1:d})})) \end{matrix} \right) \\ \det (J) = \prod^n_{i=d+1} \exp (\alpha_\theta (z_{1:d})_i) = \exp (\sum^n_{i=d+1} \alpha_\theta (z_{1:d})_i)\]
  • Non-volume preserving transformation in general since determinant can be less than or greater than 1

Autoregressive Models as Normalizing Flow Models

  • Consider a Gaussian autoregressive model: \[ p(x) = \prod^n_{i=1} p(x_i | x_{<i})\] such that \(p(x_i|x_{<i})=\mathscr{N} (\mu_i (x_1,...,x_{i-1}), \exp (\alpha_i (x_1,...,x_{i-1}))^2)\).\(\mu_i (\cdot), \alpha_i (\cdot)\) are neural networks for \(i > 1\) and constants for \(i = 1\).
  • Sampler for this model:
    • Sample \(z_i \backsim \mathscr{N} (0,1)\) for \(i=1,...,n\)
    • Let \(x_1 = \exp (\alpha_1) z_1 + \mu_1\). Compute \(\mu_2(x_1),\alpha_2 (x_1)\)
    • Let \(x_2 = \exp (\alpha_2) z_2 + \mu_2\). Compute \(\mu_3(x_1,x_2),\alpha_3 (x_1,x_2)\)
    • Let \(x_3 = \exp (\alpha_3) z_3 + \mu_3\)...
  • Flow interpretation: transforms samples from the standard Gaussian \((z_1,z_2,...,z_n)\) to those generated from the model \((x_1,x_2,...,x_n)\) via invertible transformations (parameterized by \(\mu_i (\cdot), \alpha_i (\cdot)\)

Masked Autoregressive Flow (MAF)

fwd.png

  • Forward mapping \(z \mapsto x\)
    • Let \(x_1 = \exp (\alpha_1)z_1 + \mu_1\). Compute \(\mu_2(x_1), \alpha_2(x_1)\)
    • Let \(x_2 = \exp (\alpha_2)z_2 + \mu_2\). Compute \(\mu_3(x_1, x_2), \alpha_3(x_1, x_2)\)
  • Sampling is sequential and slow (like autoregressive): \(\mathscr{O}(n)\) time

bwd.png

  • Inverse mapping from \(x \mapsto z\):
    • Compute all \(\mu_i, \alpha_i\) (can be done in parallel)
    • Let \(z_1 = (x_1 - \mu_1) / \exp (\alpha_1)\) (scale and shift)
    • Let \(z_2 = (x_2 - \mu_2) / \exp (\alpha_2)\)
    • Let \(z_3 = (x_3 - \mu_3) / \exp (\alpha_3)\)...
  • Jacobian is lower diagonal, hence efficient determinant computation
  • Likelihood evaluation is easy and parallelizable (计算\(p(x_1),p(x_2|x_1),...,p(x_D|x_{1:D-1})\)时可以并行)
  • sampling autoregressive models is slow because you must wait for all previous \(x_{1:i−1}\) to be computed before computing new \(x_i\)
  • Layers with different variable orderings can be stacked

Inverse Autoregressive Flow (IAF)

IAF.png

  • Forward mapping \(z \mapsto x\) (parallel):
    • Sample \(z_i \backsim \mathscr{N} (0,1)\) for \(i=1,...,n\)
    • Compute all \(\mu_i, \alpha_i\) (can be done in parallel)
    • Let \(x_1 = \exp (\alpha_1) z_1 + \mu_1\)
    • Let \(x_2 = \exp (\alpha_2) z_2 + \mu_2\)...
  • Inverse mapping from \(x \mapsto z\) (sequential):
    • Let \(z_1 = (x_1 − \mu_1)/ \exp (\alpha_1)\). Compute \(\mu_2(z_1), \alpha_2(z_1)\)
    • Let \(z_2 = (x_2 − \mu_2)/ \exp (\alpha_2)\). Compute \(\mu_3(z_1,z_2), \alpha_3(z_1,z_2)\)
  • Fast to sample from, slow to evaluate likelihoods of data points (train)
  • Note: Fast to evaluate likelihoods of a generated point (cache \(z_1,z_2,..., z_n\))

IAF vs. MAF

  • Computational tradeoffs
    • MAF: Fast likelihood evaluation, slow sampling
    • IAF: Fast sampling, slow likelihood evaluation
  • MAF more suited for training based on MLE, density estimation
  • IAF more suited for real-time generation

Energy-Based Models (EBMs)

Parameterizing probability distributions

  1. non-negative: \(p(x) \geq 0\)
    • \(g_\theta (\bf{x}) =f_\theta(\bf{x})^2\)
    • \(g_\theta (\bf{x}) =\exp (f_\theta(\bf{x}))\)
    • \(g_\theta (\bf{x}) =| f_\theta(\bf{x})|\)
    • \(g_\theta (\bf{x}) = \log(1 + \exp ( f_\theta(\bf{x})))\)
  2. sum-to-one: \(\sum_x p(x) = 1\) (or \(\int p(x) dx = 1\) for continuous variables) \[p_\theta (\bf{x}) = \frac{1}{\rm{Volume} (g_\theta)} g_\theta (\bf{x}) = \frac{1}{\int g_\theta (\bf{x}) d\bf{x}} g_\theta (\bf{x})\]
  • Example: choose \(g_\theta(\bf{x})\) so that we know the volume analytically as a function of \(\theta\).
    • \(g_{(\mu, \sigma)} (x) = e^{-\frac{(x-\mu)^2}{2\sigma^2}}\). Volume: \(\int e^{-\frac{x-\mu}{2\sigma^2}}dx=\sqrt{2 \pi \sigma^2}\). (Gaussian)
    • \(g_\lambda (x) = e^{-\lambda x}\). Volume: \(\int^{+ \infty}_0 e^{+ \lambda x} dx=\frac{1}{\lambda}\). (Exponential)
    • \(g_\theta (x) = h(x) \exp \{ \theta \cdot T(x)\}\). Volume: \(\exp \{ A (\theta)\}\) where \(A(\theta)=\log \int h(x) \exp \{\theta \cdot T(x) \} dx\). (Exponential family)
      • Normal, Poisson, exponential, Bernoulli
      • beta, gamma, Dirichlet, Wishart, etc.

Likelihood based learning

More complex models can be obtained by combining these building blocks.

  1. Autoregressive: Products of normalized objects \(p_\theta (\bf{x}) p_{\theta^{'}(\bf{x})} (\bf{y})\): \[\int_x \int_y p_\theta (\bf{x}) p_{\theta^{'}(\bf{x})} (\bf{y}) d \bf{x} d\bf{y} = \int_x p_\theta (\bf{x}) \underbrace{\int_y p_{\theta^{'}} (\bf{y}) d \bf{y} }_{=1} d \bf{x}=\int_x p_{\theta} (\bf{x}) d\bf{x} = 1\]

  2. Latent variables: Mixtures of normalized objects \(\alpha p_\theta (\bf{x}) + (1-\alpha) p_{\theta^{'}} (\bf{x})\) \[\int_x \alpha p_{\theta} (\bf{x}) + (1- \alpha) p_{\theta^{'}} (\bf{x}) d \bf{x} = \alpha + (1-\alpha) = 1\] ## Energy-based model

\[p_\theta(\bf{x}) = \frac{1}{\int \exp (f_\theta (\bf{x}))d \bf{x}} \exp (f_\theta (\bf{x})) = \frac{1}{Z(\theta)} \exp (f_\theta (\bf{x})) \]

The volume/normalization constant (partition function)

\[Z(\theta) = \int \exp (f_\theta (\bf{x}))d \bf{x}\]

  • \(-f_\theta(\bf{x})\) is called the energy
  • Intuitively, configurations \(\bf{x}\) with low energy (high \(f_\theta(\bf{x})\)) are more likely.

Pros: + Very flexible model architectures. + Stable training. + Relatively high sample quality. + Flexible composition.

Cons: + Sampling from \(p_\theta (\bf{x})\) is hard + Evaluating and optimizing likelihood \(p_\theta (\bf{x})\) is hard (learning is hard) + No feature learning (but can add latent variables)

Curse of dimensionality: The fundamental issue is that computing \(Z(\theta)\) numerically (when no analytic solution is available) scales exponentially in the number of dimensions of \(\bf{x}\).

Nevertheless, some tasks do not require knowing \(Z(\theta)\). Given a trained model, many applications require relative comparisons. Hence \(Z(\theta)\) is not needed. Comparing the probability of two points is easy: \(p_\theta (\bf{x}')/p_\theta (\bf{x}) = \exp (f_\theta (\bf{x}') - f_\theta (\bf{x}))\)

Sampling from energy-based models

  • Maximum likelihood training: \(\max_\theta{f_\theta(x_{\rm{train}})−\log Z(\theta)}\).
    • Contrastive divergence: \[ \nabla_\theta f_\theta (\bf{x}_{\rm{train}}) - \nabla_\theta \log Z(\theta) \approx \nabla_\theta f_\theta (\bf{x}_{\rm{train}}) - \nabla_\theta f_\theta(\bf{x}_{\rm{sample}})\] where \(\bf{x}_{\rm{sample}} \backsim p_\theta (\bf{x})\).
  • Unadjusted Langevin MCMC
    • \(\bf{x}^0 \backsim \pi (\bf{x})\)
    • Repeat for \(t=0,1,2,\cdots,T-1\)
      • \(\bf{z}^t \backsim \mathscr{N} (0,I)\)
      • \(\bf{x}^{t+1} = \bf{x}^t + \epsilon \nabla_x \log p_\theta (\bf{x})|_{\bf{x}=\bf{x}^t} + \sqrt{2 \epsilon} \bf{z}^t\)
  • Properties:
    • \(\bf{x}^{\top}\) converges to \(p_\theta(\bf{x})\) when \(T \rightarrow \infty\) and \(\epsilon \rightarrow 0\).
    • \(\nabla_x \log p_\theta (\bf{x}) = \nabla_x f_\theta (\bf{x})\) for continuous energy-based models.
    • Convergence slows down as dimensionality grows.

Training without sampling

Score Function

\[s_\theta (\bf{x}) := \nabla_x \log p_\theta (\bf{x}) = \nabla_x f_\theta (\bf{x}) - \underbrace{\nabla_x \log Z(\theta)}_{=0} = \nabla_x f_\theta (\bf{x}) \]

\(s_\theta (\bf{x})\) is independent of the partition function \(Z(\theta)\).

Score Matching

Minimizing the Fisher divergence between \(p_{data}(\bf{x})\) and the EBM \(p_\theta(\bf{x})\)

\[\frac{1}{2} E_{x \backsim p_{\rm{data}}} [|| \nabla_x \log p_{\rm{data}}(\bf{x}) - s_\theta (\bf{x}) ||^2_2] \\ = \frac{1}{2} E_{x \backsim p_{\rm{data}}} [|| \nabla_x \log p_{\rm{data}}(\bf{x}) - \nabla_x f_\theta (\bf{x}) ||^2_2]\]

  1. Sample a mini-batch of datapoints \(\{\bf{x}_1, \bf{x}_2,\cdots, \bf{x}_n \} \backsim p_{\rm{data}} (\bf{x})\).
  2. Estimate the score matching loss with the empirical mean \[\frac{1}{n} \sum^n_{i=1} [\frac{1}{2} || \nabla_x \log p_\theta (\bf{x}_i) ||^2_2 + \rm{tr} (\nabla^2_x \log p_\theta (\bf{x}_i)) ] \\ = \frac{1}{n} \sum^n_{i=1} [\frac{1}{2} || \nabla_x f_\theta (\bf{x}_i) ||^2_2 + \rm{tr} (\nabla^2_x f_\theta (\bf{x}_i)) ]\]
  3. Stochastic gradient descent.

Summary

  • Autoregressive models. \(p_\theta(x_1, x_2,\cdots, x_n) = \prod^n_{i=1} p_\theta(x_i | x<i)\)
  • Normalizing flow models. \(p_\theta(\bf{x}) = p(\bf{z})| \det J_{f_θ} (\bf{x})|\), where \(\bf{z} = f_θ(\bf{x})\)
  • Variational autoencoders: \(p_\theta(\bf{x}) = \int p(\bf{z}) p_{\theta} (\bf{x}|\bf{z}) d\bf{z}\).
  • Energy-based models: \(p_\theta (\bf{x}) = \frac{e^{f_\theta}}{Z(\theta)}\)
  • Pros:
    • Maximum likelihood training
    • Principled model comparison via likelihoods
  • Cons:
    • Model architectures are restricted.
    • Special architectures or surrogate losses to deal with intractable partition functions
  • Generative Adversarial Networks (GANs).
    • $ E_{x∼p_{}} [D_(]$.
  • Pros:
    • Two sample tests. Can optimize f-divergences and the Wasserstein distance.
    • Very flexible model architectures.
    • Samples typically have better quality
  • Cons:
    • Require adversarial training. Training instability and mode collapse.
    • No principled way to compare different models
    • No principled termination criteria for training

Reference

[1] 宏观理解Diffusion过程: DDPM = 拆楼 + 建楼
[2] DDPM原论文遵循Diffusion-Original: DDPM = 自回归式VAE
[3] 和score matching串联,和DDIM延续: DDPM = 贝叶斯 + 去噪