Recent from talks
Contribute something
Nothing was collected or created yet.
Marginal likelihood
View on Wikipedia| Part of a series on |
| Bayesian statistics |
|---|
| Posterior = Likelihood × Prior ÷ Evidence |
| Background |
| Model building |
| Posterior approximation |
| Estimators |
| Evidence approximation |
| Model evaluation |
A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample for all possible values of the parameters; it can be understood as the probability of the model itself and is therefore often referred to as model evidence or simply evidence.
Due to the integration over the parameter space, the marginal likelihood does not directly depend upon the parameters. If the focus is not on model comparison, the marginal likelihood is simply the normalizing constant that ensures that the posterior is a proper probability. It is related to the partition function in statistical mechanics.[1]
Concept
[edit]Given a set of independent identically distributed data points where according to some probability distribution parameterized by , where itself is a random variable described by a distribution, i.e. the marginal likelihood in general asks what the probability is, where has been marginalized out (integrated out):
The above definition is phrased in the context of Bayesian statistics in which case is called prior density and is the likelihood. Recognizing that the marginal likelihood is the normalizing constant of the Bayesian posterior density , one also has the alternative expression[2]
which is an identity in . The marginal likelihood quantifies the agreement between data and prior in a geometric sense made precise[how?] in de Carvalho et al. (2019). In classical (frequentist) statistics, the concept of marginal likelihood occurs instead in the context of a joint parameter , where is the actual parameter of interest, and is a non-interesting nuisance parameter. If there exists a probability distribution for [dubious – discuss], it is often desirable to consider the likelihood function only in terms of , by marginalizing out :
Unfortunately, marginal likelihoods are generally difficult to compute. Exact solutions are known for a small class of distributions, particularly when the marginalized-out parameter is the conjugate prior of the distribution of the data. In other cases, some kind of numerical integration method is needed, either a general method such as Gaussian integration or a Monte Carlo method, or a method specialized to statistical problems such as the Laplace approximation, Gibbs/Metropolis sampling, or the EM algorithm.
It is also possible to apply the above considerations to a single random variable (data point) , rather than a set of observations. In a Bayesian context, this is equivalent to the prior predictive distribution of a data point.
Applications
[edit]Bayesian model comparison
[edit]In Bayesian model comparison, the marginalized variables are parameters for a particular type of model, and the remaining variable is the identity of the model itself. In this case, the marginalized likelihood is the probability of the data given the model type, not assuming any particular model parameters. Writing for the model parameters, the marginal likelihood for the model M is
It is in this context that the term model evidence is normally used. This quantity is important because the posterior odds ratio for a model M1 against another model M2 involves a ratio of marginal likelihoods, called the Bayes factor:
which can be stated schematically as
- posterior odds = prior odds × Bayes factor
See also
[edit]This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. (July 2010) |
References
[edit]- ^ Šmídl, Václav; Quinn, Anthony (2006). "Bayesian Theory". The Variational Bayes Method in Signal Processing. Springer. pp. 13–23. doi:10.1007/3-540-28820-1_2.
- ^ Chib, Siddhartha (1995). "Marginal likelihood from the Gibbs output". Journal of the American Statistical Association. 90 (432): 1313–1321. doi:10.1080/01621459.1995.10476635.
Further reading
[edit]- Charles S. Bos. "A comparison of marginal likelihood computation methods". In W. Härdle and B. Ronz, editors, COMPSTAT 2002: Proceedings in Computational Statistics, pp. 111–117. 2002. (Available as a preprint on SSRN 332860)
- de Carvalho, Miguel; Page, Garritt; Barney, Bradley (2019). "On the geometry of Bayesian inference". Bayesian Analysis. 14 (4): 1013‒1036. (Available as a preprint on the web: [1])
- Lambert, Ben (2018). "The devil is in the denominator". A Student's Guide to Bayesian Statistics. Sage. pp. 109–120. ISBN 978-1-4739-1636-4.
- The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay.
Marginal likelihood
View on GrokipediaDefinition and Basics
Formal Definition
The marginal likelihood, denoted as , is formally defined as the marginal probability of the observed data , obtained by integrating the joint distribution of the data and the model parameters over the parameter space: where is the likelihood function and is the prior distribution over the parameters .[6] This integral represents the prior predictive distribution of the data, averaging the likelihood across all possible parameter values weighted by the prior.[6] The process of marginalization in this context involves integrating out the parameters , which are treated as nuisance parameters, to yield a probability for the data that is independent of any specific parameter values.[6] This eliminates dependence on by averaging over its uncertainty as specified by the prior, resulting in a quantity that summarizes the model's predictive content for the observed data without conditioning on particular estimates of .[6] In contrast to the joint probability , which depends on both the data and parameters, the marginal likelihood marginalizes away and thus provides the unconditional probability of the data under the model.[6] This distinction underscores how marginalization transforms the parameter-dependent joint into a data-only marginal.[6] The term "marginal likelihood" is used interchangeably with "marginal probability" or "evidence" to refer to , particularly in contexts emphasizing its role as a model-level summary.[7]Bayesian Interpretation
In Bayesian statistics, the marginal likelihood represents the predictive probability of the observed data under a given model, obtained by averaging the likelihood over all possible parameter values weighted by their prior distribution. This integration marginalizes out the parameters, providing a coherent measure of how well the model explains the data while incorporating prior beliefs about parameter uncertainty. As such, it serves as the Bayesian evidence for the model, distinct from conditional measures that fix parameters at specific values.[8] The marginal likelihood functions as a key indicator of model plausibility, where a higher value suggests the model offers a better overall fit to the data after accounting for the full range of parameter uncertainty encoded in the prior. Unlike point estimates, this averaging process naturally balances goodness-of-fit with the model's capacity to generalize, implicitly favoring parsimonious models that do not overfit by spreading probability mass too thinly across parameters. In model comparison, it enables direct assessment of competing hypotheses by quantifying the relative support each model receives from the data alone.[8] In contrast to maximum likelihood estimation, which maximizes the likelihood at a single point estimate of the parameters and thus relies on plug-in predictions that ignore uncertainty, the marginal likelihood integrates over the entire parameter space. This integration implicitly penalizes model complexity through the prior's influence, as overly flexible models tend to dilute the evidence by assigning low density to the data under broad parameter ranges, whereas maximum likelihood can favor intricate models that fit noise without such restraint. Consequently, marginal likelihood promotes more robust inference in finite samples by embedding Occam's razor directly into the evaluation.[2] Historically, the marginal likelihood emerged as a central component in Bayesian model assessment through its role in computing Bayes factors, which compare the evidence for rival models and update prior odds to posterior odds in a coherent manner. This framework, formalized in seminal work on Bayes factors, provided a principled alternative to frequentist testing for evaluating scientific theories.[8]Mathematical Framework
Expression in Parametric Models
In parametric statistical models, the marginal likelihood integrates out the model parameters to obtain the probability of the observed data under the model specification. Consider a parametric model indexed by a parameter vector , where is the parameter space. The marginal likelihood is then expressed as where denotes the likelihood function and is the prior distribution over the parameters given the model.[8] This formulation arises naturally in Bayesian inference as the normalizing constant for the posterior distribution.[8] When the parameter space is discrete, the continuous integral is replaced by a summation: This discrete form is applicable in scenarios such as finite mixture models with a discrete number of components or models with categorical parameters.[8] The dimensionality of the integral or sum corresponds directly to the dimension of , reflecting the number of parameters being marginalized out; higher dimensionality increases computational demands but maintains the core structure of the expression.[8] A concrete example illustrates this in a simple univariate normal model, where the observed data follows with known , and the prior on the mean is . The marginal likelihood involves the integral Completing the square in the exponent yields a closed-form normal distribution for the marginal: However, in the more general univariate normal model with unknown variance and a conjugate normal-inverse-gamma prior on , the full marginal likelihood over both parameters results in a closed-form Student's t-distribution for , highlighting how prior choices enable analytical tractability.Relation to Posterior and Likelihood
In Bayesian inference, the marginal likelihood plays a central role in Bayes' theorem, which expresses the posterior distribution aswhere is the likelihood function, is the prior distribution, and is the marginal likelihood of the data . This formulation updates prior beliefs about the parameters with observed data to yield the posterior .[6] The marginal likelihood serves as the normalizing constant—or evidence—that ensures the posterior integrates to 1 over the parameter space, transforming the unnormalized product into a proper probability density. Computationally, it is obtained by integrating the joint density of data and parameters: . This integration averages the likelihood across all possible parameter values weighted by the prior, providing a data-dependent measure of model plausibility independent of specific .[6] Unlike the conditional likelihood , which conditions on fixed parameters and evaluates model fit for particular , the marginal likelihood marginalizes the full likelihood over the prior distribution, incorporating parameter uncertainty from the outset. This contrasts with the profiled likelihood in frequentist settings, where nuisance parameters are eliminated by maximization rather than integration, leading to a point-estimate-based adjustment without prior weighting.[6][9] The marginal likelihood thus quantifies the total predictive uncertainty in the data under the model, encompassing both the variability in the likelihood due to unknown parameters and the prior's influence on that variability, whereas the likelihood alone fixes and ignores such averaging. This broader uncertainty measure supports coherent probabilistic reasoning in Bayesian frameworks.[6]
Computation Methods
Analytical Approaches
Analytical approaches to computing the marginal likelihood rely on cases where the integral over the parameter space can be evaluated exactly in closed form, which occurs primarily under the use of conjugate priors.[10] Conjugate priors are distributions from the same family as the posterior, allowing the marginal likelihood to be derived as a normalizing constant without numerical integration.[11] This solvability holds when the likelihood and prior combine to yield a posterior in the same parametric family, simplifying the integration to known functions like the beta or gamma integrals.[10] A classic example is the beta-binomial model, where the binomial likelihood for coin flips is paired with a beta prior on the success probability . The marginal likelihood is then given by where denotes the beta function, representing the integral over .[12] This closed-form expression arises directly from the conjugacy, enabling exact Bayesian inference for binary data.[13] In Gaussian models, conjugate priors such as the normal-inverse-Wishart facilitate analytical marginal likelihoods, often resulting in a multivariate Student's t-distribution for the data. For a multivariate normal likelihood with unknown mean and precision , the marginal distribution of the data is where , , , , is the dimensionality, and is the number of observations.[10] This highlights the tractability in linear Gaussian settings. However, such analytical solutions are rare, particularly in high-dimensional settings or with non-conjugate priors, where the parameter integral becomes intractable and necessitates numerical methods.[14] These limitations stem from the exponential growth in integration complexity as dimensionality increases, restricting exact computations to low-dimensional or specially structured models.[15]Numerical and Approximation Techniques
When exact analytical computation of the marginal likelihood is infeasible, such as in non-conjugate models with high-dimensional parameter spaces, numerical and approximation techniques become essential for estimation. These methods leverage sampling or asymptotic expansions to approximate the integral , balancing computational feasibility with accuracy. Monte Carlo-based approaches, in particular, provide unbiased or consistent estimators but often require careful tuning to manage variance, while deterministic methods offer faster but potentially biased approximations suitable for large-scale applications. Monte Carlo methods, including importance sampling, form a foundational class of estimators for the marginal likelihood. The importance sampling estimator approximates the integral as , where are samples drawn from a proposal distribution chosen to approximate the posterior . This self-normalized form ensures consistency under mild conditions on , though high variance arises if poorly covers the posterior support, necessitating techniques like adaptive proposals or multiple importance sampling to improve efficiency in complex models.[16] Markov Chain Monte Carlo (MCMC) methods extend these ideas by generating dependent samples from the posterior to estimate the marginal likelihood without direct prior sampling. Bridge sampling, for instance, uses samples from both the prior and posterior to compute ratios via an optimal bridge function, yielding , where minimizes estimation variance; this approach is particularly robust for multimodal posteriors. The harmonic mean estimator, another posterior-based method, approximates as the reciprocal of the average inverse likelihood over posterior samples, , but suffers from infinite variance in heavy-tailed cases, prompting stabilized variants.[17][18] Deterministic approximations, such as the Laplace method, provide closed-form estimates by exploiting local behavior around the posterior mode. This technique approximates the log-posterior as quadratic via its Hessian matrix at the mode , leading to , where is the parameter dimension; the approximation improves asymptotically as for large samples but can bias results in small-data or highly nonlinear settings. It is computationally efficient, requiring only optimization and Hessian evaluation, and serves as a building block for higher-order corrections in moderate dimensions.[19] Variational inference offers a scalable lower bound on the log-marginal likelihood through optimization, framing estimation as minimizing the Kullback-Leibler divergence between a tractable variational distribution and the true posterior. The evidence lower bound (ELBO) states , maximized by adjusting (often mean-field or structured forms) via stochastic gradients; this bound is tight when and enables fast inference in massive datasets, though it underestimates the true value and requires careful family selection to avoid loose bounds.[20] Recent advancements, including those post-2020, have enhanced these techniques for scalability in deep learning models, where parameter spaces exceed millions of dimensions. Annealed importance sampling (AIS) refines importance sampling by introducing intermediate distributions bridging prior and posterior, with differentiable variants enabling end-to-end optimization of annealing schedules for tighter estimates in generative models.[21][22] Sequential Monte Carlo (SMC) methods, propagating particles through annealed sequences with resampling, have been adapted for deep Bayesian networks, achieving unbiased marginal likelihoods via thermodynamic integration while integrating neural proposals for efficiency in high-dimensional tasks like variational autoencoders.[23] More recent methods, such as generalized stepping-stone sampling (2024), improve efficiency in specific domains like pulsar timing analysis.[24] These developments prioritize variance reduction and GPU acceleration, facilitating model comparison in neural architectures.Applications in Statistics
Model Comparison and Selection
One key application of the marginal likelihood in Bayesian statistics is model comparison, where it serves as the basis for the Bayes factor, a measure of the relative evidence provided by the data for two competing models. The Bayes factor comparing model to model is defined as the ratio of their marginal likelihoods:where denotes the observed data.[25] This ratio quantifies how much more likely the data are under than under , after integrating out model parameters via their priors, thereby providing a coherent framework for hypothesis testing and model selection without relying on arbitrary significance thresholds.[25] For nested models, where is a special case of (e.g., by imposing parameter restrictions), the Savage-Dickey density ratio offers a convenient way to compute the Bayes factor using posterior and prior densities evaluated at the boundary values of the restricted parameters. Specifically, the Bayes factor is given by
evaluated at the values of (the restricted parameter) that define the nesting boundary, under the encompassing model , assuming the prior under matches the prior under for the common parameters. This approach simplifies computation by avoiding full marginal likelihood estimation for both models, though it requires careful prior specification to ensure validity.[26] The marginal likelihood inherently implements Occam's razor in model selection by favoring parsimonious models that adequately fit the data, as more complex models must allocate prior probability mass across larger parameter spaces, effectively penalizing overparameterization unless the data strongly support the added complexity.[25] In practice, Bayes factors are interpreted using guidelines such as those proposed by Kass and Raftery, where values between 1 and 3 provide barely worth mentioning evidence, 3 to 20 indicate positive evidence, 20 to 150 strong evidence, and greater than 150 very strong evidence in favor of the numerator model (e.g., suggests substantial support for ).[25] However, Bayes factors exhibit sensitivity to the choice of priors, which can lead to varying conclusions if priors are not chosen judiciously, and their computation becomes prohibitively expensive for high-dimensional or large-scale models, often necessitating approximations like those from MCMC methods.[25]
