Recent from talks
Nothing was collected or created yet.
Flow-based generative model
View on Wikipedia| Part of a series on |
| Machine learning and data mining |
|---|
A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow,[1][2][3] which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.
The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the loss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.
In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and generative adversarial network do not explicitly represent the likelihood function.
Method
[edit]
Let be a (possibly multivariate) random variable with distribution .
For , let be a sequence of random variables transformed from . The functions should be invertible, i.e. the inverse function exists. The final output models the target distribution.
The log likelihood of is (see derivation):
Learning probability distributions by differentiating such log Jacobians originated in the Infomax (maximum likelihood) approach to ICA,[4] which forms a single-layer (K=1) flow-based model. Relatedly, the single layer precursor of conditional generative flows appeared in [5].
To efficiently compute the log likelihood, the functions should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functions are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,[6] RealNVP,[7] and Glow.[8]
Derivation of log likelihood
[edit]Consider and . Note that .
By the change of variable formula, the distribution of is:
Where is the determinant of the Jacobian matrix of .
By the inverse function theorem:
By the identity (where is an invertible matrix), we have:
The log likelihood is thus:
In general, the above applies to any and . Since is equal to subtracted by a non-recursive term, we can infer by induction that:
Training method
[edit]As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the Kullback–Leibler divergence between the model's likelihood and the target distribution to be estimated. Denoting the model's likelihood and the target distribution to learn, the (forward) KL-divergence is:
The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method by importance sampling. Indeed, if we have a dataset of samples each independently drawn from the target distribution , then this term can be estimated as:
Therefore, the learning objective
is replaced by
In other words, minimizing the Kullback–Leibler divergence between the model's likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution.[9]
A pseudocode for training normalizing flows is as follows:[10]
- INPUT. dataset , normalizing flow model .
- SOLVE. by gradient descent
- RETURN.
Variants
[edit]Planar Flow
[edit]The earliest example.[11] Fix some activation function , and let with the appropriate dimensions, thenThe inverse has no closed-form solution in general.
The Jacobian is .
For it to be invertible everywhere, it must be nonzero everywhere. For example, and satisfies the requirement.
Nonlinear Independent Components Estimation (NICE)
[edit]Let be even-dimensional, and split them in the middle.[6] Then the normalizing flow functions arewhere is any neural network with weights .
is just , and the Jacobian is just 1, that is, the flow is volume-preserving.
When , this is seen as a curvy shearing along the direction.
Real Non-Volume Preserving (Real NVP)
[edit]The Real Non-Volume Preserving model generalizes NICE model by:[7]
Its inverse is , and its Jacobian is . The NICE model is recovered by setting . Since the Real NVP map keeps the first and second halves of the vector separate, it's usually required to add a permutation after every Real NVP layer.
Generative Flow (Glow)
[edit]In generative flow model,[8] each layer has 3 parts:
- channel-wise affine transformwith Jacobian .
- invertible 1x1 convolutionwith Jacobian . Here is any invertible matrix.
- Real NVP, with Jacobian as described in Real NVP.
The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.
Masked Autoregressive Flow (MAF)
[edit]An autoregressive model of a distribution on is defined as the following stochastic process:[12]
where and are fixed functions that define the autoregressive model.
By the reparameterization trick, the autoregressive model is generalized to a normalizing flow:The autoregressive model is recovered by setting .
The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel).
The Jacobian matrix is lower-diagonal, so the Jacobian is .
Reversing the two maps and of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.[13]
Continuous Normalizing Flow (CNF)
[edit]Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic.[14][15] Let be the latent variable with distribution . Map this latent variable to data space with the following flow function:
where is an arbitrary function and can be modeled with e.g. neural networks.
The inverse function is then naturally:[14]
And the log-likelihood of can be found as:[14]
Since the trace depends only on the diagonal of the Jacobian , this allows "free-form" Jacobian.[16] Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.
The trace can be estimated by "Hutchinson's trick":[17][18]
Given any matrix , and any random with , we have . (Proof: expand the expectation directly.)
Usually, the random vector is sampled from (normal distribution) or (Rademacher distribution).
When is implemented as a neural network, neural ODE methods[19] would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.
There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible that all solve the same problem).
By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".[20]
Any homeomorphism of can be approximated by a neural ODE operating on , proved by combining Whitney embedding theorem for manifolds and the universal approximation theorem for neural networks.[21]
To regularize the flow , one can impose regularization losses. The paper [17] proposed the following regularization loss based on optimal transport theory:where are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.
Flows on manifolds
[edit]When a probabilistic flow transforms a distribution on an -dimensional smooth manifold embedded in , where , and where the transformation is specified as a function, , the scaling factor between the source and transformed PDFs is not given by the naive computation of the determinant of the Jacobian (which is zero), but instead by the determinant(s) of one or more suitably defined matrices. This section is an interpretation of the tutorial in the appendix of Sorrenson et al.(2023),[22] where the more general case of non-isometrically embedded Riemann manifolds is also treated. Here we restrict attention to isometrically embedded manifolds.
As running examples of manifolds with smooth, isometric embedding in we shall use:
- The unit hypersphere: , where flows can be used to generalize e.g. Von Mises-Fisher or uniform spherical distributions.
- The simplex interior: , where -way categorical distributions live; and where flows can be used to generalize e.g. Dirichlet, or uniform simplex distributions.
As a first example of a spherical manifold flow transform, consider the normalized linear transform, which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by the invertible matrix :
In full Euclidean space, is not invertible, but if we restrict the domain and co-domain to the unitsphere, then is invertible (more specifically it is a bijection and a homeomorphism and a diffeomorphism), with inverse . The Jacobian of , at is , which has rank and determinant of zero; while as explained here, the factor (see subsection below) relating source and transformed densities is: .
Differential volume ratio
[edit]For , let be an -dimensional manifold with a smooth, isometric embedding into . Let be a smooth flow transform with range restricted to . Let be sampled from a distribution with density . Let , with resultant (pushforward) density . Let be a small, convex region containing and let be its image, which contains ; then by conservation of probability mass:
where volume (for very small regions) is given by Lebesgue measure in -dimensional tangent space. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term the differential volume ratio.
To obtain concrete formulas for volume on the -dimensional manifold, we construct by mapping an -dimensional rectangle in (local) coordinate space to the manifold via a smooth embedding function: . At very small scale, the embedding function becomes essentially linear so that is a parallelotope (multidimensional generalization of a parallelogram). Similarly, the flow transform, becomes linear, so that the image, is also a parallelotope. In , we can represent an -dimensional parallelotope with an matrix whose column-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. The volume is given by the absolute value of the determinant of this matrix. If more generally (as is the case here), an -dimensional paralellotope is embedded in , it can be represented with a (tall) matrix, say . Denoting the parallelotope as , its volume is then given by the square root of the Gram determinant:
In the sections below, we show various ways to use this volume formula to derive the differential volume ratio.
Simplex flow
[edit]As a first example, we develop expressions for the differential volume ratio of a simplex flow, , where . Define the embedding function:
which maps a conveniently chosen, -dimensional representation, , to the embedded manifold. The Jacobian is . To define , the differential volume element at the transformation input (), we start with a rectangle in -space, having (signed) differential side-lengths, from which we form the square diagonal matrix , the columns of which span the rectangle. At very small scale, we get , with:

To understand the geometric interpretation of the factor , see the example for the 1-simplex in the diagram at right.
The differential volume element at the transformation output (), is the parallelotope, , where is the Jacobian of at . Its volume is:
so that the factor cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing the representation function, , which simply extracts the first components. The Jacobian is . Observe that, since , the chain rule for function composition gives: . By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor , which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation, :
which, if , can be used to derive the pushforward density after a change of variables, :
This formula is valid only because the simplex is flat and the Jacobian, is constant. The more general case for curved manifolds is discussed below, after we present two concrete examples of simplex flow transforms.
Simplex calibration transform
[edit]A calibration transform, , which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilistic -class classifier,[23][24] uses the softmax function to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. For and with parameters, and the transform can be specified as:
where the log is applied elementwise. After some algebra the differential volume ratio can be expressed as:
- This result can also be obtained by factoring the density of the SGB distribution,[25] which is obtained by sending Dirichlet variates through .
While calibration transforms are most often trained as discriminative models, the reinterpretation here as a probabilistic flow allows also the design of generative calibration models based on this transform. When used for calibration, the restriction can be imposed to prevent direction reversal in log-probability space. With the additional restriction , this transform (with discriminative training) is known in machine learning as temperature scaling.
Generalized calibration transform
[edit]The above calibration transform can be generalized to , with parameters and invertible:[26]
where the condition that has as an eigenvector ensures invertibility by sidestepping the information loss due to the invariance: . Note in particular that is the only allowed diagonal parametrization, in which case we recover , while (for ) generalization is possible with non-diagonal matrices. The inverse is:
The differential volume ratio is:
If is to be used as a calibration transform, further constraint could be imposed, for example that be positive definite, so that , which avoids direction reversals. (This is one possible generalization of in the parameter.)
For , and positive definite, then and are equivalent in the sense that in both cases, is a straight line, the (positive) slope and offset of which are functions of the transform parameters. For does generalize .
It must however be noted that chaining multiple flow transformations does not give a further generalization, because:
In fact, the set of transformations form a group under function composition. The set of transformations form a subgroup.
Also see: Dirichlet calibration,[27] which generalizes , by not placing any restriction on the matrix, , so that invertibility is not guaranteed. While Dirichlet calibration is trained as a discriminative model, can also be trained as part of a generative calibration model.
Differential volume ratio for curved manifolds
[edit]Consider a flow, on a curved manifold, for example which we equip with the embedding function, that maps a set of angular spherical coordinates to . The Jacobian of is non-constant and we have to evaluate it at both input () and output (). The same applies to , the representation function that recovers spherical coordinates from points on , for which we need the Jacobian at the output (). The differential volume ratio now generalizes to:
For geometric insight, consider , where the spherical coordinates are co-latitude, and longitude, . At , we get , which gives the radius of the circle at that latitude (compare e.g. polar circle to equator). The differential volume (surface area on the sphere) is: .
The above derivation for is fragile in the sense that when using fixed functions , there may be places where they are not well-defined, for example at the poles of the 2-sphere where longitude is arbitrary. This problem is sidestepped (using standard manifold machinery) by generalizing to local coordinates (charts), where in the vicinities of , we map from local -dimensional coordinates to and back using the respective function pairs and . We continue to use the same notation for the Jacobians of these functions (), so that the above formula for remains valid.
We can however, choose our local coordinate system in a way that simplifies the expression for and indeed also its practical implementation.[22] Let be a smooth idempotent projection () from the projectible set, , onto the embedded manifold. For example:
- The positive orthant of is projected onto the simplex as:
- Non-zero vectors in are projected onto the unitsphere as:
For every , we require of that its Jacobian, has rank (the manifold dimension), in which case is an idempotent linear projection onto the local tangent space (orthogonal for the unitsphere: ; oblique for the simplex: ). The columns of span the -dimensional tangent space at . We use the notation, for any matrix with orthonormal columns () that span the local tangent space. Also note: . We can now choose our local coordinate embedding function, :
Since the Jacobian is injective (full rank: ), a local (not necessarily unique) left inverse, say with Jacobian , exists such that and . In practice we do not need the left inverse function itself, but we do need its Jacobian, for which the above equation does not give a unique solution. We can however enforce a unique solution for the Jacobian by choosing the left inverse as, :
We can now finally plug and into our previous expression for , the differential volume ratio, which because of the orthonormal Jacobians, simplifies to:[28]
Practical implementation
[edit]For learning the parameters of a manifold flow transformation, we need access to the differential volume ratio, , or at least to its gradient w.r.t. the parameters. Moreover, for some inference tasks, we need access to itself. Practical solutions include:
- Sorrenson et al.(2023)[22] give a solution for computationally efficient stochastic parameter gradient approximation for
- For some hand-designed flow transforms, can be analytically derived in closed form, for example the above-mentioned simplex calibration transforms. Further examples are given below in the section on simple spherical flows.
- On a software platform equipped with linear algebra and automatic differentiation, can be automatically evaluated, given access to only .[29] But this is expensive for high-dimensional data, with at least computational costs. Even then, the slow automatic solution can be invaluable as a tool for numerically verifying hand-designed closed-form solutions.
Simple spherical flows
[edit]In machine learning literature, various complex spherical flows formed by deep neural network architectures may be found.[22] In contrast, this section compiles from statistics literature the details of three very simple spherical flow transforms, with simple closed-form expressions for inverses and differential volume ratios. These flows can be used individually, or chained, to generalize distributions on the unitsphere, . All three flows are compositions of an invertible affine transform in , followed by radial projection back onto the sphere. The flavours we consider for the affine transform are: pure translation, pure linear and general affine. To make these flows fully functional for learning, inference and sampling, the tasks are:
- To derive the inverse transform, with suitable restrictions on the parameters to ensure invertibility.
- To derive in simple closed form the differential volume ratio, .
An interesting property of these simple spherical flows is that they don't make use of any non-linearities apart from the radial projection. Even the simplest of them, the normalized translation flow, can be chained to form perhaps surprisingly flexible distributions.
Normalized translation flow
[edit]The normalized translation flow, , with parameter , is given by:
The inverse function may be derived by considering, for : and then using to get a quadratic equation to recover , which gives:
from which we see that we need to keep real and positive for all . The differential volume ratio is given (without derivation) by Boulerice & Ducharme(1994) as:[30]
This can indeed be verified analytically:
- By a laborious manipulation of .
- By setting in , which is given below.
Finally, it is worth noting that and do not have the same functional form.
Normalized linear flow
[edit]The normalized linear flow, , where parameter is an invertible matrix, is given by:
The differential volume ratio is:
This result can be derived indirectly via the Angular central Gaussian distribution (ACG),[31] which can be obtained via normalized linear transform of either Gaussian, or uniform spherical variates. The first relationship can be used to derive the ACG density by a marginalization integral over the radius; after which the second relationship can be used to factor out the differential volume ratio. For details, see ACG distribution.
Normalized affine flow
[edit]The normalized affine flow, , with parameters and , invertible, is given by:
The inverse function, derived in a similar way to the normalized translation inverse is:
where . The differential volume ratio is:
The final RHS numerator was expanded from by the matrix determinant lemma. Recalling , the equality between and holds because not only:
but also, by orthogonality of to the local tangent space:
where is the Jacobian of differentiated w.r.t. its input, but not also w.r.t. to its parameter.
Downsides
[edit]Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them.[32]
Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set).[33] Some hypotheses were formulated to explain this phenomenon, among which the typical set hypothesis,[34] estimation issues when training models,[35] or fundamental issues due to the entropy of the data distributions.[36]
One of the most interesting properties of normalizing flows is the invertibility of their learned bijective map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.[37]
Applications
[edit]Flow-based generative models have been applied on a variety of modeling tasks, including:
References
[edit]- ^ Tabak, Esteban G.; Vanden-Eijnden, Eric (2010). "Density estimation by dual ascent of the log-likelihood". Communications in Mathematical Sciences. 8 (1): 217–233. doi:10.4310/CMS.2010.v8.n1.a11.
- ^ Tabak, Esteban G.; Turner, Cristina V. (2012). "A family of nonparametric density estimation algorithms". Communications on Pure and Applied Mathematics. 66 (2): 145–164. doi:10.1002/cpa.21423. hdl:11336/8930. S2CID 17820269.
- ^ Papamakarios, George; Nalisnick, Eric; Jimenez Rezende, Danilo; Mohamed, Shakir; Bakshminarayanan, Balaji (2021). "Normalizing flows for probabilistic modeling and inference". Journal of Machine Learning Research. 22 (1): 2617–2680. arXiv:1912.02762.
- ^ Bell, A. J.; Sejnowski, T. J. (1995). "An information-maximization approach to blind separation and blind deconvolution". Neural Computation. **7** (6): 1129–1159. doi:10.1162/neco.1995.7.6.1129.
- ^ Roth, Z.; Baram, Y. (1996). "Multidimensional density shaping by sigmoids". IEEE Transactions on Neural Networks. **7** (5): 1291–1298. doi:10.1109/72.536322.
- ^ a b Dinh, Laurent; Krueger, David; Bengio, Yoshua (2014). "NICE: Non-linear Independent Components Estimation". arXiv:1410.8516 [cs.LG].
- ^ a b Dinh, Laurent; Sohl-Dickstein, Jascha; Bengio, Samy (2016). "Density estimation using Real NVP". arXiv:1605.08803 [cs.LG].
- ^ a b c Kingma, Diederik P.; Dhariwal, Prafulla (2018). "Glow: Generative Flow with Invertible 1x1 Convolutions". arXiv:1807.03039 [stat.ML].
- ^ Papamakarios, George; Nalisnick, Eric; Rezende, Danilo Jimenez; Shakir, Mohamed; Balaji, Lakshminarayanan (March 2021). "Normalizing Flows for Probabilistic Modeling and Inference". Journal of Machine Learning Research. 22 (57): 1–64. arXiv:1912.02762.
- ^ Kobyzev, Ivan; Prince, Simon J.D.; Brubaker, Marcus A. (November 2021). "Normalizing Flows: An Introduction and Review of Current Methods". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (11): 3964–3979. arXiv:1908.09257. Bibcode:2021ITPAM..43.3964K. doi:10.1109/TPAMI.2020.2992934. ISSN 1939-3539. PMID 32396070. S2CID 208910764.
- ^ Danilo Jimenez Rezende; Mohamed, Shakir (2015). "Variational Inference with Normalizing Flows". arXiv:1505.05770 [stat.ML].
- ^ Papamakarios, George; Pavlakou, Theo; Murray, Iain (2017). "Masked Autoregressive Flow for Density Estimation". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. arXiv:1705.07057.
- ^ Kingma, Durk P; Salimans, Tim; Jozefowicz, Rafal; Chen, Xi; Sutskever, Ilya; Welling, Max (2016). "Improved Variational Inference with Inverse Autoregressive Flow". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv:1606.04934.
- ^ a b c Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv:1810.01367 [cs.LG].
- ^ Lipman, Yaron; Chen, Ricky T. Q.; Ben-Hamu, Heli; Nickel, Maximilian; Le, Matt (2022-10-01). "Flow Matching for Generative Modeling". arXiv:2210.02747 [cs.LG].
- ^ Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018-10-22). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv:1810.01367 [cs.LG].
- ^ a b Finlay, Chris; Jacobsen, Joern-Henrik; Nurbekyan, Levon; Oberman, Adam (2020-11-21). "How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization". International Conference on Machine Learning. PMLR: 3154–3164. arXiv:2002.02798.
- ^ Hutchinson, M.F. (January 1989). "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines". Communications in Statistics - Simulation and Computation. 18 (3): 1059–1076. doi:10.1080/03610918908812806. ISSN 0361-0918.
- ^ Chen, Ricky T. Q.; Rubanova, Yulia; Bettencourt, Jesse; Duvenaud, David K. (2018). "Neural Ordinary Differential Equations" (PDF). In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R. (eds.). Advances in Neural Information Processing Systems. Vol. 31. Curran Associates, Inc. arXiv:1806.07366.
- ^ Dupont, Emilien; Doucet, Arnaud; Teh, Yee Whye (2019). "Augmented Neural ODEs". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.
- ^ Zhang, Han; Gao, Xi; Unterman, Jacob; Arodz, Tom (2019-07-30). "Approximation Capabilities of Neural ODEs and Invertible Residual Networks". arXiv:1907.12998 [cs.LG].
- ^ a b c d Sorrenson, Peter; Draxler, Felix; Rousselot, Armand; Hummerich, Sander; Köthe, Ullrich (2023). "Learning Distributions on Manifolds with Free-Form Flows". arXiv:2312.09852 [cs.LG].
- ^ Brümmer, Niko; van Leeuwen, D. A. (2006). "On calibration of language recognition scores". Proceedings of IEEE Odyssey: The Speaker and Language Recognition Workshop. San Juan, Puerto Rico. pp. 1–8. doi:10.1109/ODYSSEY.2006.248106.
- ^ Ferrer, Luciana; Ramos, Daniel (2024). "Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration". arXiv:2408.02841 [stat.ML].
- ^ Graf, Monique (2019). "The Simplicial Generalized Beta distribution - R-package SGB and applications". Libra. Retrieved 26 May 2025.
{{cite web}}: CS1 maint: numeric names: authors list (link) - ^ Brümmer, Niko (18 October 2010). Measuring, refining and calibrating speaker and language information extracted from speech (PhD thesis). Stellenbosch, South Africa: Department of Electrical & Electronic Engineering, University of Stellenbosch.
- ^ Meelis Kull, Miquel Perelló‑Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter A. Flach (28 October 2019). "Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration". arXiv:1910.12656 [cs.LG].
{{cite arXiv}}: CS1 maint: multiple names: authors list (link) - ^ The tangent matrices are not unique: if has orthonormal columns and is an orthogonal matrix, then also has orthonormal columns that span the same subspace; it is easy to verify that is invariant to such transformations of the tangent representatives.
- ^ With PyTorch:
from torch.linalg import qr from torch.func import jacrev def logRf(pi, m, f, x): y = f(x) Fx, PI = jacrev(f)(x), jacrev(pi) Tx, Ty = [qr(PI(z)).Q[:,:m] for z in (x,y)] return (Ty.T @ Fx @ Tx).slogdet().logabsdet - ^ Boulerice, Bernard; Ducharme, Gilles R. (1994). "Decentered Directional Data". Annals of the Institute of Statistical Mathematics. 46 (3): 573–586. doi:10.1007/BF00773518.
- ^ Tyler, David E (1987). "Statistical analysis for the angular central Gaussian distribution on the sphere". Biometrika. 74 (3): 579–589. doi:10.2307/2336697. JSTOR 2336697.
- ^ a b Helminger, Leonhard; Djelouah, Abdelaziz; Gross, Markus; Schroers, Christopher (2020). "Lossy Image Compression with Normalizing Flows". arXiv:2008.10486 [cs.CV].
- ^ Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2018). "Do Deep Generative Models Know What They Don't Know?". arXiv:1810.09136v3 [stat.ML].
- ^ Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2019). "Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality". arXiv:1906.02994 [stat.ML].
- ^ Zhang, Lily; Goldstein, Mark; Ranganath, Rajesh (2021). "Understanding Failures in Out-of-Distribution Detection with Deep Generative Models". Proceedings of Machine Learning Research. 139: 12427–12436. PMC 9295254. PMID 35860036.
- ^ Caterini, Anthony L.; Loaiza-Ganem, Gabriel (2022). "Entropic Issues in Likelihood-Based OOD Detection". pp. 21–26. arXiv:2109.10794 [stat.ML].
- ^ Behrmann, Jens; Vicol, Paul; Wang, Kuan-Chieh; Grosse, Roger; Jacobsen, Jörn-Henrik (2020). "Understanding and Mitigating Exploding Inverses in Invertible Neural Networks". arXiv:2006.09347 [cs.LG].
- ^ Ping, Wei; Peng, Kainan; Gorur, Dilan; Lakshminarayanan, Balaji (2019). "WaveFlow: A Compact Flow-based Model for Raw Audio". arXiv:1912.01219 [cs.SD].
- ^ Shi, Chence; Xu, Minkai; Zhu, Zhaocheng; Zhang, Weinan; Zhang, Ming; Tang, Jian (2020). "GraphAF: A Flow-based Autoregressive Model for Molecular Graph Generation". arXiv:2001.09382 [cs.LG].
- ^ Yang, Guandao; Huang, Xun; Hao, Zekun; Liu, Ming-Yu; Belongie, Serge; Hariharan, Bharath (2019). "PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows". arXiv:1906.12320 [cs.CV].
- ^ Kumar, Manoj; Babaeizadeh, Mohammad; Erhan, Dumitru; Finn, Chelsea; Levine, Sergey; Dinh, Laurent; Kingma, Durk (2019). "VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation". arXiv:1903.01434 [cs.CV].
- ^ Rudolph, Marco; Wandt, Bastian; Rosenhahn, Bodo (2021). "Same Same But DifferNet: Semi-Supervised Defect Detection with Normalizing Flows". arXiv:2008.12577 [cs.CV].
External links
[edit]Flow-based generative model
View on GrokipediaOverview
Definition and Principles
Flow-based generative models constitute a class of probabilistic models that approximate complex data distributions by applying a sequence of invertible transformations, or bijections, to a simple base probability distribution, such as a multivariate Gaussian. These transformations, collectively termed normalizing flows, enable the construction of flexible density estimators capable of capturing intricate multimodal structures in high-dimensional data. The approach originates from the idea of deforming a known simple density into a target distribution while preserving invertibility to facilitate both generation and evaluation tasks.[5] At the core of these models is the principle of exact likelihood computation, achieved through the change of variables formula applied to the sequence of bijections. This allows for the direct evaluation of the probability density function for any data point by accounting for the volume change induced by each transformation via the absolute value of its Jacobian determinant. Consequently, flow-based models support maximum likelihood training, where the objective is to maximize the log-likelihood of observed data without relying on variational bounds, adversarial objectives, or Monte Carlo approximations, leading to stable and interpretable optimization.[5][8] The operational pipeline of a flow-based model begins with sampling latent variables from the base distribution, which are then sequentially transformed through the invertible functions to yield generated data samples. For inference, the process reverses: observed data is mapped backward through the inverse transformations to the base space, enabling precise density estimation. This bidirectional invertibility distinguishes flows from other generative paradigms.[5] Flow-based models mitigate prominent shortcomings in alternative generative frameworks. In contrast to variational autoencoders, which frequently yield blurry reconstructions owing to their reliance on reconstruction losses like mean squared error that average over latent uncertainties, flows produce sharp samples by directly modeling the data density. Unlike generative adversarial networks, which forgo explicit likelihoods and are prone to mode collapse and training instabilities from the minimax game, flows offer tractable densities and consistent maximum likelihood optimization for reliable performance.[8]Historical Development
The roots of flow-based generative models trace back to invertible transformations in statistics and physics. In statistics, the Box-Cox transformation, introduced in 1964, provided a foundational method for stabilizing variance and achieving normality through power-law mappings that preserve order and invertibility. In physics, Hamiltonian flows, originating from classical mechanics in the 19th century, describe volume-preserving dynamics in phase space, influencing later concepts of continuous-time transformations in machine learning. These ideas laid the groundwork for bijective mappings in probabilistic modeling, though their formal integration into machine learning occurred later. The formalization of normalizing flows in machine learning began in the mid-2010s as an enhancement to variational inference and density estimation. Dinh et al. (2014) proposed NICE, the first deep generative model using additive coupling layers for non-linear independent component estimation, allowing efficient training on high-dimensional data like images without restrictive assumptions.[2] This was followed by Rezende and Mohamed (2015), who introduced planar and radial flows to construct flexible approximate posteriors by composing invertible transformations with a base distribution, enabling exact likelihood evaluation and addressing limitations in traditional variational methods.[5] Key milestones continued with discrete flow architectures. Building on NICE, Dinh et al. (2016) developed Real NVP, which extended coupling to affine transformations with scale and translation, improving expressivity and scaling to color image datasets through autoregressive partitioning.[9] Kingma and Dhariwal (2018) advanced the field with Glow, incorporating invertible 1x1 convolutions and activation normalization, achieving state-of-the-art likelihood on datasets like CIFAR-10 (3.35 bits per dimension) and enabling high-fidelity image synthesis.[4] Post-2018 developments emphasized continuous and flexible flows. Grathwohl et al. (2019) introduced FFJORD, a continuous normalizing flow based on neural ordinary differential equations, using Hutchinson's trace estimator for scalable log-density computation and demonstrating competitive performance on tabular and image data.[10] Concurrently, Durkan et al. (2019) proposed neural spline flows, leveraging monotonic rational-quadratic splines for highly expressive bijections, outperforming prior models in density estimation on datasets like POWER and CelebA.[11] A significant connection between normalizing flows and diffusion models was established in 2021 by Song et al., who proposed the probability flow ODE in their work on score-based generative models. This formulation transforms the stochastic differential equations of diffusion processes into deterministic ordinary differential equations, akin to continuous normalizing flows, enabling efficient and deterministic sampling.[12] By 2022-2025, flows integrated with diffusion models gained prominence; Liu et al. (2022) developed rectified flows, which straighten probability paths for faster sampling (reducing steps from thousands to 1-3 in CIFAR-10 generation) and domain transfer.[6] Lipman et al. (2022) introduced flow matching, a simulation-free training paradigm for continuous flows that regresses vector fields, enabling efficient generative modeling and surpassing diffusion models in sample quality on ImageNet.[13] In 2025, advancements included CrystalFlow for generating crystalline materials structures and improvements in adaptive flow matching algorithms presented at ICML, extending applications to materials science and enhancing training efficiency.[14][15] These hybrids have extended flows to scientific simulations, such as molecular dynamics and climate modeling, by providing exact likelihoods for uncertainty quantification.[6]Mathematical Foundations
Change of Variables Theorem
The change of variables theorem provides the foundational mechanism for transforming probability densities in normalizing flows, enabling the expression of a complex target density in terms of a simple base density through an invertible mapping.[5] Specifically, for a bijective function and random variables and , the density of is given by where denotes the Jacobian matrix of evaluated at , and is its determinant.[5] Equivalently, expressing the target density in terms of the base density where , it becomes [16] The derivation begins with the requirement that the transformation preserves total probability mass, starting from the Dirac delta formulation for the density of the transformed variable. Consider the probability element: for an infinitesimal volume around , the probability must equal , where and . Thus, yielding after cancellation, with the absolute value ensuring non-negativity of densities.[16] This follows from integrating over the Dirac delta , where , and change of variables in the integral leads to the Jacobian factor as the volume scaling term.[5] Intuitively, the theorem accounts for how the invertible transformation distorts volumes in the input space: expansions (determinant >1) compress the density to maintain probability conservation, while contractions (determinant <1) expand it, inversely scaling the base density to reflect local stretching or compression.[16] In the multivariate case, the Jacobian is the matrix of partial derivatives , and its determinant quantifies the oriented volume change under the linear approximation of at . Computing directly is via methods like LU decomposition, but for high dimensions in flows, efficient structures (e.g., triangular Jacobians) allow evaluation as a product of diagonals.[16] Approximations such as the trace-log-determinant estimator, leveraging , use stochastic trace estimation (e.g., Hutchinson's estimator with random vectors) to reduce complexity in continuous-time flows, where the log-determinant integrates as .[16]Normalizing Flow Construction
Normalizing flows are constructed by composing a sequence of invertible and differentiable transformations, denoted as , where each maps an intermediate variable to . This composition transforms a sample from a simple base distribution, such as a standard Gaussian, into a data sample . The overall change of density relies on the chain rule for the Jacobian determinant, yielding the total log-determinant as the sum , which ensures computational tractability by avoiding the need to compute a single large Jacobian matrix.[17][18] Each transformation must be bijective, meaning it has an efficient exact inverse , and its Jacobian determinant must be computable in linear time, where is the data dimensionality, to enable exact likelihood evaluation without prohibitive costs. Designs achieving this often employ structures with triangular Jacobians, where the determinant simplifies to the product of diagonal elements, or other decompositions like the matrix determinant lemma for near-identity transformations. These requirements stem from the need for both forward normalization (mapping data to the base) and inverse generation (sampling from the base to data), preserving the diffeomorphic properties throughout the composition.[17][5] The term "normalizing" in normalizing flows specifically refers to the forward direction of the flow, which transforms observed data through the inverse composition to a latent variable in the base distribution, facilitating density estimation via the change of variables formula. Conversely, generation proceeds via the forward flow , starting from base samples. A simple illustrative example is the affine transformation , where is an invertible matrix and a bias vector; the Jacobian determinant is , computable directly as the product of eigenvalues or via LU decomposition, though more scalable variants restrict to diagonal or triangular forms for efficiency. This affine coupling forms the basis for more expressive flows in early models.[2][18]Model Components
Base Distribution
In flow-based generative models, the base distribution serves as the simple prior probability density from which the latent variables are drawn, providing the starting point for the invertible transformations that map it to the complex data distribution. This prior, often denoted as , is crucial for enabling efficient sampling from the model by first generating samples from the base and then applying the flow transformations forward. Its simplicity facilitates both density evaluation and the overall tractability of the model, as the log-likelihood of data points can be computed exactly via the change of variables formula once the base density is evaluated.[17] Common choices for the base distribution include the standard multivariate Gaussian, which is isotropic and centered at zero with unit variance, making it easy to sample and evaluate. For instance, in the NICE model, a factorial Gaussian or logistic distribution is used to assume independence across dimensions, with the logistic preferred for its smoother gradients during training. Similarly, RealNVP employs an isotropic unit Gaussian prior, while Glow uses a spherical multivariate Gaussian to leverage its analytical tractability. Uniform distributions on the unit hypercube are also common, particularly for bounding the support or in discrete flow variants. For multimodal data, Gaussian mixtures can extend the base to capture multiple modes more naturally from the outset.[17][19][20][21] Selection of the base distribution prioritizes distributions that allow straightforward evaluation of and efficient sampling, often assuming independence between dimensions to simplify computations like the Jacobian determinant in the flow. This independence assumption aids scalability, as it decouples the latent variables and reduces the need for complex joint evaluations. The base is transformed via a sequence of bijective flows to approximate the target data distribution.[17] A key limitation of fixed simple bases like the Gaussian is their constrained expressivity for data with heavy tails, as the light-tailed prior may require excessively complex transformations to capture outliers effectively.[17]Bijective Transformations
Bijective transformations form the backbone of flow-based generative models, enabling the mapping of a simple base distribution to a complex target distribution while preserving the ability to compute exact likelihoods. These transformations must be invertible, with both the forward and inverse mappings computable efficiently, and the determinant of the Jacobian matrix evaluated tractably to facilitate density estimation via the change of variables formula. Seminal work established that diffeomorphic functions—smooth bijections with smooth inverses—satisfy these requirements, allowing flows to model multimodal and high-dimensional data distributions.[5] Invertible transformations in normalizing flows are broadly categorized into linear (or affine), nonlinear, and structured types, each designed to balance expressiveness with computational efficiency. Linear transformations, such as affine mappings of the form where is an invertible matrix, provide a straightforward way to rescale and shift variables, but their Jacobian determinant is simply , which can be costly to compute for dense matrices unless is structured. Nonlinear transformations often apply element-wise activations combined with scaling, such as monotonic functions that ensure invertibility by restricting the output range, allowing the model to capture non-linear dependencies while maintaining bijectivity through careful parameterization. Structured transformations, including coupling and autoregressive designs, impose architectural constraints to guarantee invertibility; for instance, coupling layers partition the input and transform only a subset based on the remainder, while autoregressive layers process variables sequentially with triangular Jacobians.[17][2][9] Invertibility guarantees in these transformations distinguish between volume-preserving and non-volume-preserving mappings. Volume-preserving transformations, such as orthogonal matrices or permutations, have a Jacobian determinant of exactly 1, simplifying density computations since they do not alter the data volume. In contrast, non-volume-preserving transformations, like those involving affine scalings, allow flexible density reshaping but require explicit determinant calculation, often through diagonal or triangular forms to avoid expensive full-matrix operations. This distinction enables flows to either maintain uniformity in certain directions or introduce scaling for better expressivity.[17][9] Efficiency in bijective transformations is achieved through specialized designs that reduce the computational burden of inversion and Jacobian evaluation. For triangular matrices, common in autoregressive flows, the determinant is the product of diagonal elements, computable in linear time for -dimensional inputs; however, applying the transformation and its inverse is sequential, resulting in time complexity due to autoregressive dependencies. Coupling layers enhance parallelism by leaving half the variables unchanged, so the Jacobian determinant becomes the product of scale factors applied to the transformed subset, enabling cost for the determinant computation beyond the neural network evaluations for scale and translation functions. These tricks make flows scalable to high dimensions, such as images, without prohibitive overhead.[17][2][9] A representative general form for many bijective transformations, particularly in coupling layers, is the affine coupling transformation: where is the partitioned input, denotes element-wise multiplication, and and are neural networks outputting scales and translations based on the untransformed partition . The inverse is explicitly given by solving for and , with the Jacobian determinant as , ensuring tractable likelihoods when composed with a base distribution. This form, introduced in early coupling-based flows, exemplifies how neural networks parameterize invertible mappings while integrating seamlessly with the overall flow construction.[9]Training and Inference
Exact Likelihood Computation
Flow-based generative models enable exact likelihood computation through the change of variables theorem applied to a sequence of invertible transformations. Consider a dataset drawn from an unknown data distribution . The model parameterizes as the pushforward of a tractable base distribution , typically a standard Gaussian, via a bijective flow composed of layers . The density is then , where is the Jacobian matrix of the inverse flow.[5] To derive the log-likelihood recursively, start with the change of variables for a single transformation , yielding , or in log-space, Applying this iteratively from the base to the data , the full log-likelihood becomes where are intermediate latents (computed via the inverse flow from to for efficiency). This formulation ensures tractable evaluation by designing layers where the log-determinant is computed in or time per layer, avoiding full Jacobian matrices.[5] The training objective is maximum likelihood estimation, maximizing the average log-likelihood over the dataset: where parameterizes the neural networks in each . Recent advancements, such as flow matching for continuous flows, provide simulation-free training objectives to accelerate learning without numerical ODE solvers.[13] This is minimized as the negative log-likelihood using stochastic gradient descent (SGD) or variants like Adam. Gradients flow through the invertible transformations and log-determinants via automatic differentiation, enabling end-to-end optimization without approximations like variational bounds used in other generative models.[5] Model performance is often evaluated using bits per dimension (BPD), a normalized metric assessing density estimation quality: , where is the data dimensionality (e.g., image pixels). Lower BPD indicates better compressibility and fit to the data distribution, for example, the Glow model achieved 3.35 BPD on CIFAR-10, while more recent flow-based models have achieved as low as 2.56 BPD (as of 2023).[4][22]Sampling Procedures
In flow-based generative models, sampling generates new data points by starting with a draw from the base distribution and propagating it through the sequence of invertible transformations. Typically, a latent variable is sampled from a simple base distribution such as a standard multivariate Gaussian , followed by iterative application of bijective functions: for , yielding the final sample from the target data distribution .[5][17] This forward pass exploits the model's invertibility to produce exact samples without stochastic approximation.[2] The efficiency of this procedure stems from the tractable structure of the transformations, which avoids full Jacobian matrix computations. For architectures using coupling layers, each transformation operates in time, where is the data dimensionality, resulting in overall sampling complexity of for layers.[17] Certain variants, such as Glow, further enable parallelization across dimensions via affine couplings and invertible convolutions, achieving sub-second synthesis for high-resolution images on consumer GPUs.[4] A key challenge arises in deep flows with large , where the sequential nature of layer applications slows sampling proportionally to depth. Post-2020 advancements mitigate this through knowledge distillation, training compact student flows to replicate the sampling behavior of deeper teachers, thereby reducing inference time while preserving generative quality.[23] In contrast to Markov Chain Monte Carlo (MCMC) methods, which rely on iterative chains and burn-in to approximate samples from unnormalized densities, flow models support direct, deterministic one-pass generation with no convergence overhead.[17] This exact sampling complements the models' exact likelihood training, enabling unified optimization for both density estimation and generation.[5]Key Variants
Coupling Layer Flows
Coupling layer flows represent an early class of scalable normalizing flow models that enable efficient, parallelizable transformations by splitting the input into two parts and applying a bijection to only one part conditioned on the other, resulting in a triangular Jacobian matrix that allows exact and tractable determinant computation.[2] This design ensures invertibility and volume preservation or scaling while facilitating parallel computation across dimensions, contrasting with sequential dependencies in autoregressive flows.[9] The Non-linear Independent Components Estimation (NICE) model, introduced in 2014, pioneered this approach using additive coupling layers. In NICE, the input is partitioned into two halves and , with the transformation defined as: where is a function, such as a multi-layer perceptron with ReLU activations, that maps the first half to a correction for the second.[2] The Jacobian of this additive coupling is lower triangular with ones on the diagonal, yielding a determinant of 1, which makes the transformation volume-preserving and simplifies likelihood evaluation.[2] By alternating the partitioning across multiple layers, NICE achieves expressive density estimation on datasets like CIFAR-10, attaining a negative log-likelihood of 5371.78 nats.[2] Building on NICE, the Real-valued Non-Volume Preserving (Real NVP) model from 2016 extended coupling layers to affine transformations for greater expressivity. The coupling function is: where and are scale and translation functions implemented via deep convolutional networks, and denotes element-wise multiplication.[9] The Jacobian determinant is , which is efficiently computable in a single forward pass.[9] For image data, Real NVP employs masking strategies, such as checkerboard patterns that alternate transformed and frozen pixels or channels, enabling effective modeling of spatial correlations and achieving 3.49 bits per dimension on CIFAR-10.[9] Glow, proposed in 2018, further advanced coupling layer flows by integrating invertible 1×1 convolutions and a multi-scale architecture, enhancing mixing between dimensions and scalability to high resolutions. In Glow, affine couplings split along channels, with and computed using convolutional networks ending in 1×1 convolutions for efficiency; these are followed by invertible 1×1 convolutions, which apply a learned linear mixing across channels and have determinants computed via LU decomposition.[4] The multi-scale structure progressively downsamples the data through squeeze layers (reducing spatial dimensions while increasing channels) and factorizes the flow into levels, allowing coarse-to-fine generation.[4] On CIFAR-10, Glow achieves 3.35 bits per dimension, improving over Real NVP, with ablation studies showing that 1×1 convolutions outperform fixed permutations by enabling faster convergence and lower negative log-likelihood.[4]Autoregressive Flows
Autoregressive flows impose an ordering on the dimensions of the data, transforming each dimension conditioned solely on the preceding dimensions . This autoregressive structure yields a lower-triangular Jacobian matrix, whose determinant is the product of its diagonal elements, , enabling tractable exact likelihood evaluation in time.[24] The Masked Autoregressive Flow (MAF), introduced by Papamakarios et al. (2017), implements this framework using masked autoregressive networks—such as the Masked Autoencoder for Distribution Estimation (MADE)—to parameterize the scale and translation functions in component-wise affine transformations . The masking ensures that the conditioner for each dimension depends only on prior ones, allowing parallel forward passes for efficient density estimation, while the inverse transformation is computed sequentially but accelerated via caching of intermediate values.[25] MAF has demonstrated state-of-the-art performance in general-purpose density estimation benchmarks, such as tabular data and images.[25] In contrast, the Inverse Autoregressive Flow (IAF), proposed by Kingma et al. (2016), inverts the autoregressive direction by defining transformations where each output dimension conditions the input for subsequent ones, facilitating parallel sampling from the base distribution. However, this design makes density evaluation sequential, increasing computational cost compared to the forward direction.[26] IAF is particularly suited for variational inference in high-dimensional latent spaces, improving posterior approximations in models like variational autoencoders.[26] Autoregressive flows provide high expressivity for data with inherent sequential dependencies, such as time series, where the ordered transformations naturally capture temporal correlations.[24] Recent advancements address the sequential limitations of these models; for instance, block-autoregressive flows, developed by De Cao et al. (2019), partition dimensions into blocks with intra-block parallelism while enforcing inter-block autoregression, using fewer parameters than standard variants and achieving competitive density estimation on datasets like images and text.[27] Unlike coupling layer flows, which enable full parallelism without dimensional ordering, autoregressive flows prioritize precise conditional modeling at the expense of sequential computation.[24]Continuous Flows
Continuous flows extend normalizing flows by modeling transformations as continuous-time dynamical systems, providing smoother and more expressive mappings between distributions. The core formulation defines the trajectory of a sample via the ordinary differential equation (ODE) where is a time-dependent neural network, is sampled from the base distribution (e.g., a standard Gaussian), and maps to the target data distribution.[10] This setup allows for infinite-depth transformations, contrasting with discrete flows composed of finite layers.[10] The probability density evolution follows the continuity equation, ensuring exact likelihood computation: Integrating from to yields , where the trace of the Jacobian accounts for volume changes along the flow.[10] Direct computation of this trace is costly at for dimension , but approximations enable scalability. The Continuous Normalizing Flow (CNF) framework, introduced by Grathwohl et al., builds on Neural ODEs by solving the dynamics with ODE integrators and approximating the trace using the Hutchinson estimator: for random vector with and , achieving cost per sample.[10] This unbiased estimator, often with multiple Monte Carlo samples for variance reduction, supports training via maximum likelihood. FFJORD, a key implementation, relaxes Jacobian constraints to free-form architectures, accelerating training while maintaining reversibility for efficient sampling and density evaluation; it achieves state-of-the-art density estimation, such as 3.35 bits per dimension on CIFAR-10 using multiscale architectures.[10] ODEs in CNFs are typically solved with adaptive methods like Dormand-Prince (Dopri5), which dynamically adjust integration steps for precision (e.g., relative tolerance ) at constant memory cost independent of "depth."[10] Backpropagation through solvers can incur high memory overhead, addressed by reversible flow variants that reconstruct activations on-the-fly or adjoint sensitivity methods to compute gradients without storing intermediates.[10] Recent advances, such as simulation-free training via Flow Matching introduced in 2022, regress conditional vector fields directly to learn CNFs more efficiently without ODE simulation during optimization, enabling faster convergence and broader applicability in generative modeling.[13] Stochastic continuous flows, extending the deterministic ODE to stochastic differential equations (SDEs), incorporate noise for better uncertainty quantification and robustness, as demonstrated in frameworks using stochastic interpolants to bridge distributions.[28] Further extensions include Riemannian continuous normalizing flows for data on manifolds (Gemici et al., 2023) and enhanced flow matching techniques for scalable training (as of 2025).[29][7]Extensions to Manifolds
Volume Preservation on Curved Spaces
Flow-based generative models, traditionally defined on Euclidean spaces, face significant challenges when extended to non-Euclidean spaces such as the simplex or sphere, where the underlying geometry is curved. In these settings, the transformations must preserve the Riemannian volume elements defined by the manifold's metric tensor to maintain probabilistic consistency. Unlike flat spaces, curved manifolds require accounting for the intrinsic geometry to ensure that the probability measure transforms correctly under bijective mappings.[30] A brief overview of Riemannian geometry is essential for understanding these extensions. A Riemannian manifold is a smooth manifold equipped with a metric tensor , a positive-definite bilinear form on each tangent space that varies smoothly across the manifold. Manifolds can be embedded in higher-dimensional Euclidean spaces, but local computations often rely on charts—diffeomorphic mappings to open subsets of —which provide coordinate representations. The Riemannian volume form in local coordinates is , generalizing the Lebesgue measure.[30] The core adaptation for volume preservation in flow models on manifolds is the differential volume ratio, which generalizes the absolute value of the Jacobian determinant from the Euclidean change of variables theorem. For a diffeomorphism on a Riemannian manifold , with denoting the Jacobian matrix of in local coordinates at , the volume scaling factor is . This factor adjusts the density to account for both the coordinate Jacobian and the variation in the metric tensor across points. In the Euclidean case, where is the identity matrix, it reduces to .[30] This formula arises from the pullback of the Riemannian volume form under the diffeomorphism. The pullback , where is the volume form at , yields . For probability conservation, the density with respect to the volume measure at relates to the density at by , which simplifies to the given determinant expression. This ensures that integrals of the density over the manifold remain invariant, preserving the total probability mass on curved domains.[30]Specific Manifold Flows
Simplex flows enable the modeling of probability distributions constrained to the probability simplex, such as those following Dirichlet distributions. The seminal simplex calibration transform, proposed by Gemici et al. (2016), bijectively maps points on the simplex to an unconstrained Euclidean space using cumulative sums, allowing standard normalizing flow architectures to be applied before mapping back. This approach ensures exact likelihood computation while preserving the manifold structure.[30] Spherical flows address distributions on the unit hypersphere, common in directional data. Normalized translation and affine flows, developed by Rezende et al. (2020), project a point on the sphere to its tangent space via the logarithmic map, apply an invertible Euclidean normalizing flow in that space, and project back using the exponential map to maintain the manifold constraint. The density adjustment accounts for the Riemannian metric, ensuring tractable log-likelihoods. These flows demonstrate strong performance in modeling von Mises-Fisher distributions and related directional statistics.[31] Practical implementations of these manifold flows often leverage hyperspherical coordinates for spheres, where a point is parameterized as , with angles for and . Singularities at poles are mitigated using padding (e.g., augmenting with a fixed coordinate) or log-ratio transforms, such as , to avoid numerical issues during optimization. For simplex flows, similar padding ensures boundary handling in cumulative mappings.[31] Applications of these flows extend to directional statistics, where spherical constructions model angular data in fields like robotics and geophysics, outperforming traditional parametric models in flexibility and likelihood accuracy. A simple pseudocode for a basic spherical projection and flow application illustrates the process:def spherical_flow(x_on_sphere, flow_net):
# Project to [tangent space](/page/Tangent_space) at x (log map approximation)
v = log_map(x_on_sphere) # v in T_x S^{d-1}, e.g., via stereographic or exponential inverse
# Apply Euclidean flow
v_transformed = flow_net(v)
# Project back to [sphere](/page/Sphere)
x_new = exp_map(x_on_sphere, v_transformed) # e.g., x cos||v|| + (v/||v||) sin||v||
# Density log adjustment: log p(x_new) = log base(x) - log_det_flow + metric_term
return x_new, log_det
def spherical_flow(x_on_sphere, flow_net):
# Project to [tangent space](/page/Tangent_space) at x (log map approximation)
v = log_map(x_on_sphere) # v in T_x S^{d-1}, e.g., via stereographic or exponential inverse
# Apply Euclidean flow
v_transformed = flow_net(v)
# Project back to [sphere](/page/Sphere)
x_new = exp_map(x_on_sphere, v_transformed) # e.g., x cos||v|| + (v/||v||) sin||v||
# Density log adjustment: log p(x_new) = log base(x) - log_det_flow + metric_term
return x_new, log_det
