Generative model is a machine learning model that draws samples approximately from the joint probability distribution of data. Such joint models potentially capture all patterns in the data, which is useful in e.g. speech synthesis, text analysis, semi-supervised learning, model-based control. Deep generative model is a generative model that learns a hierarchy of representations. Generative modeling has been a guiding principle of deep learning research.
Generator network is a generative model that samples from a parametrized graphical probabilistic model. Many generator networks are undirected graphical models, e.g. Boltzmann machines and variants; while others are directed generative nets. Differentiable generator network is a directed generator network that maps between random variables via a network implimenting a family of differentiable functions, e.g. VAE, GAN, NADE, flow models, where the gradients are useful in optimization and other tasks.
Evaluation metrics for generative models are often based on log-likelihood, but it can be tricky. Score matching [@Hyvärinen2005] is an alternative to maximum likelihood, which minimizes the difference beween the true and estimated gradients of log-density (aka score): $\nabla \log p(x)$. It is often better to use a criterion specific to the task in practice [@Theis2016]. When benchmarking generative models, the input to each algorithm must be exactly the same.
Boltzmann machine [@Fahlman1983; @Ackley1985; @Hinton1984] is an energy-based model, i.e. the joint probability distribution is defined using an energy function: $P(x) = e^{-E(x)} / Z$, where $E(x) = - x^T U x - b^T x$ is a quadratic energy function and $Z$ is the partition function that noramizes the probability density.
Restricted Boltzman machine (RBM) [@Smolensky1986] is an undirected model consisting of one layer of observable variables and one layer of latent variables, and the variables in each layer are mutually independent when conditioned on the neighboring layer.
Deep belief network (DBN) [@Hinton2006] is a mixed graphical model (i.e. partially directed/undirected) with several layers of latent variables: the connections between layers are directed downwards to the data, except that those between the top two layers are undirected.
Deep Boltzmann machine (DBM) [@Salakhutdinov2009] is a multi-layer restricted Boltzman machine, and thus an undirected model with several layers of latent variables.
Auto-encoder [@LeCun1987; @Bourlard1988] $g(f(x))$ is a feedforward network that consists of an encoder followed by a decoder and is trained to minimize a reconstruction loss function e.g. mean squared error; the encoder learns a representation of the input and the decoder reconstructs the input approximately. Autoencoders are originally used for dimensionality reduction or feature learning, and now for generative modeling. In fact, almost all generative models with latent variables derived from input can be viewed as autoencoders. Deep autoencoder is an auto-encoder where the encoder or the decoder is multi-layer. Autoencoders are often shallow, but deep autoencoders yield much better compression than corresponding shallow or linear autoencoders [@Hinton2006]. Stochastic auto-encoder is an auto-encoder where the encoder and the decoder are probabilistic. Undercomplete / overcomplete autoencoder is one whose code space cardinality is less / greater than its input space cardinality. Regularized auto-encoder uses regularization to prevent the auto-encoder from simply learning the identity function. For example, a sparse autoencoder adds to the training objective a sparsity penalty e.g. 1-norm of latent variables.
Generative stochastic network (GSN) [@Bengio2014] are stochastic auto-encoders for generative modeling: $p(h, x)$, which includes generalized denoising auto-encoder and contractive auto-encoder. Denoising auto-encoder (DAE) [@Vincent2008] is trained to minimize a loss function of data reconstructed from input with added noise: $L(x, g(f(\tilde x)))$. A trained DAE approximates the expectation of data conditioned on a corrupted data point, a weighted mean of data points near the corrupted data point, equivalent to the mean-shift vector up to a scaling factor: $g(f(\tilde x)) \approx \mathbb{E}[\mathbf{x} \mid \tilde x]$. When the reconstruction distribution is Gaussian, DAE difference vector field approximates the score of the data distribution, i.e. gradient of log-density [@Alain2014]: $g(f(\tilde x)) - \tilde x \approx \lambda \nabla \log p(\tilde x)$. Regularized score matching recovers a blurred version of the data distribution, but it is consistent when sample size grows and noise levels decrease. With corruption and reconstruction both Gaussian, DAE sampling effectively repeats kernel smoothing and (approximate) mean shift steps. Contractive auto-encoder (CAE) [@Rifai2011] adds to the training objective a term proportional to the squared Frobenius norm of the Jacobian of latent variables: $L(x, g(f(x))) + \lambda \|\nabla_x f(x)\|_F^2$; it estimates the tangent planes of the data manifold, and the MCMC sampling procedure [@Rifai2012] induces a random walk (i.e. diffusion) on the manifold [@Mesnil2012]. The CAE regularization criterion becomes much more expensive in deep autoencoders. Generalized denoising auto-encoder [@Bengio2013b] samples from a Markov chain that repeats the following steps: (1) add corruption noise, $\mathbf{\tilde x} = \mathbf{x} + \mathbf{z}$; (2) estimate parametrically the denoising/reconstruction/posterior distribution, by an autoencoder trained to maximize log-likelihood via gradient methods, $\omega = g(f(\tilde x))$; (3) sample from the estimated distribution, $p(x \mid \tilde x) = p_\omega (x; g(f(\tilde x)))$. If the estimator of the conditional distribution is consistent and the Markov chain is ergodic and satisfies detailed balance, then the stationary distribution of the Markov chain is a consistent estimator of the data generating distribution.
Variational auto-encoder (VAE) [@Kingma2014a; @Rezende2014; @Razavi2019] optimize a lower bound on the log-likelihood of the data, parallelizable training and synthesis, but can be hard to optimize, approximate inference of latent variables, have intractable density and thus incur alternative losses.
Generative adversarial network (GAN) [@Goodfellow2014] attaches a discriminative model to a generator; incur implicit losses, no encoder, may have limited support over the data distribution.
Likelihood-based generative models [@Kingma2018]: autoregressive models; variational auto-encoders (VAEs); generative flows.
Autoregressive model [@Hochreiter1997; @Graves2013; @VanDenOord2016,a,b] reversible, synthesis scales with the data dimension, unknown marginal distributions of latent variables. Autoregressive network: neural autoregressive density estimator (NADE) [@Larochelle2011]. Autoregressive flow model [@Dinh2014; @Dinh2016; @Kingma2018] have tractable reversible density, exact inference of latent variables and evaluation of log-likelihood, parallelizable inference and synthesis, memory independent of model depth.
Generative stochastic network (GSN) genearlizes the probabilistic/generative interpretation of denoising autoencoders [@Bengio2013b] with arbitrary latent variables, which is an alternative to maximum likelihood for density estimation, and samples from the learned model by running a Markov chain that adds noise and samples from the learned denoise distribution iteratively.
GSN can also be used with missing inputs and sample subsets of variables given the rest.
GSN considers the data generating distribution as the stationary distribution of a Markov chain, and estimates parametrically its transition distribution, which is much simpler than the data distribution, roughly unimodal, and generally involving a small move.
GSNs can be trained efficiently with back-propagated gradients (of the partition function / normalization constant). Noise is added to accelerate mixing of the Markov chain.
Experiments on two image datasets.
[@Goodfellow2016]