Recent from talks
Nothing was collected or created yet.
Posterior predictive distribution
View on WikipediaThis article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
| Part of a series on |
| Bayesian statistics |
|---|
| Posterior = Likelihood × Prior ÷ Evidence |
| Background |
| Model building |
| Posterior approximation |
| Estimators |
| Evidence approximation |
| Model evaluation |
In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.[1][2]
Given a set of N i.i.d. observations , a new value will be drawn from a distribution that depends on a parameter , where is the parameter space.
It may seem tempting to plug in a single best estimate for , but this ignores uncertainty about , and because a source of uncertainty is ignored, the predictive distribution will be too narrow. Put another way, predictions of extreme values of will have a lower probability than if the uncertainty in the parameters as given by their posterior distribution is accounted for.
A posterior predictive distribution accounts for uncertainty about . The posterior distribution of possible values depends on :
And the posterior predictive distribution of given is calculated by marginalizing the distribution of given over the posterior distribution of given :
Because it accounts for uncertainty about , the posterior predictive distribution will in general be wider than a predictive distribution which plugs in a single best estimate for .
Prior vs. posterior predictive distribution
[edit]The prior predictive distribution, in a Bayesian context, is the distribution of a data point marginalized over its prior distribution . That is, if and , then the prior predictive distribution is the corresponding distribution , where
This is similar to the posterior predictive distribution except that the marginalization (or equivalently, expectation) is taken with respect to the prior distribution instead of the posterior distribution.
Furthermore, if the prior distribution is a conjugate prior, then the posterior predictive distribution will belong to the same family of distributions as the prior predictive distribution. This is easy to see. If the prior distribution is conjugate, then
i.e. the posterior distribution also belongs to but simply with a different parameter instead of the original parameter Then,
Hence, the posterior predictive distribution follows the same distribution H as the prior predictive distribution, but with the posterior values of the hyperparameters substituted for the prior ones.
The prior predictive distribution is in the form of a compound distribution, and in fact is often used to define a compound distribution, because of the lack of any complicating factors such as the dependence on the data and the issue of conjugacy. For example, the Student's t-distribution can be defined as the prior predictive distribution of a normal distribution with known mean μ but unknown variance σx2, with a conjugate prior scaled-inverse-chi-squared distribution placed on σx2, with hyperparameters ν and σ2. The resulting compound distribution is indeed a non-standardized Student's t-distribution, and follows one of the two most common parameterizations of this distribution. Then, the corresponding posterior predictive distribution would again be Student's t, with the updated hyperparameters that appear in the posterior distribution also directly appearing in the posterior predictive distribution.
In some cases the appropriate compound distribution is defined using a different parameterization than the one that would be most natural for the predictive distributions in the current problem at hand. Often this results because the prior distribution used to define the compound distribution is different from the one used in the current problem. For example, as indicated above, the Student's t-distribution was defined in terms of a scaled-inverse-chi-squared distribution placed on the variance. However, it is more common to use an inverse gamma distribution as the conjugate prior in this situation. The two are in fact equivalent except for parameterization; hence, the Student's t-distribution can still be used for either predictive distribution, but the hyperparameters must be reparameterized before being plugged in.
In exponential families
[edit]Most, but not all, common families of distributions are exponential families. Exponential families have a large number of useful properties. One of these is that all members have conjugate prior distributions — whereas very few other distributions have conjugate priors.
Prior predictive distribution in exponential families
[edit]Another useful property is that the probability density function of the compound distribution corresponding to the prior predictive distribution of an exponential family distribution marginalized over its conjugate prior distribution can be determined analytically. Assume that is a member of the exponential family with parameter that is parametrized according to the natural parameter , and is distributed as
while is the appropriate conjugate prior, distributed as
Then the prior predictive distribution (the result of compounding with ) is
The last line follows from the previous one by recognizing that the function inside the integral is the density function of a random variable distributed as , excluding the normalizing function . Hence the result of the integration will be the reciprocal of the normalizing function.
The above result is independent of choice of parametrization of , as none of , and appears. ( is a function of the parameter and hence will assume different forms depending on choice of parametrization.) For standard choices of and , it is often easier to work directly with the usual parameters rather than rewrite in terms of the natural parameters.
The reason the integral is tractable is that it involves computing the normalization constant of a density defined by the product of a prior distribution and a likelihood. When the two are conjugate, the product is a posterior distribution, and by assumption, the normalization constant of this distribution is known. As shown above, the density function of the compound distribution follows a particular form, consisting of the product of the function that forms part of the density function for , with the quotient of two forms of the normalization "constant" for , one derived from a prior distribution and the other from a posterior distribution. The beta-binomial distribution is a good example of how this process works.
Despite the analytical tractability of such distributions, they are in themselves usually not members of the exponential family. For example, the three-parameter Student's t distribution, beta-binomial distribution and Dirichlet-multinomial distribution are all predictive distributions of exponential-family distributions (the normal distribution, binomial distribution and multinomial distributions, respectively), but none are members of the exponential family. This can be seen above due to the presence of functional dependence on . In an exponential-family distribution, it must be possible to separate the entire density function into multiplicative factors of three types: (1) factors containing only variables, (2) factors containing only parameters, and (3) factors whose logarithm factorizes between variables and parameters. The presence of makes this impossible unless the "normalizing" function either ignores the corresponding argument entirely or uses it only in the exponent of an expression.
Posterior predictive distribution in exponential families
[edit]When a conjugate prior is being used, the posterior predictive distribution belongs to the same family as the prior predictive distribution, and is determined simply by plugging the updated hyperparameters for the posterior distribution of the parameter(s) into the formula for the prior predictive distribution. Using the general form of the posterior update equations for exponential-family distributions (see the appropriate section in the exponential family article), we can write out an explicit formula for the posterior predictive distribution:
where
This shows that the posterior predictive distribution of a series of observations, in the case where the observations follow an exponential family with the appropriate conjugate prior, has the same probability density as the compound distribution, with parameters as specified above. The observations themselves enter only in the form
This is termed the sufficient statistic of the observations, because it tells us everything we need to know about the observations in order to compute a posterior or posterior predictive distribution based on them (or, for that matter, anything else based on the likelihood of the observations, such as the marginal likelihood).
Joint predictive distribution, marginal likelihood
[edit]It is also possible to consider the result of compounding a joint distribution over a fixed number of independent identically distributed samples with a prior distribution over a shared parameter. In a Bayesian setting, this comes up in various contexts: computing the prior or posterior predictive distribution of multiple new observations, and computing the marginal likelihood of observed data (the denominator in Bayes' law). When the distribution of the samples is from the exponential family and the prior distribution is conjugate, the resulting compound distribution will be tractable and follow a similar form to the expression above. It is easy to show, in fact, that the joint compound distribution of a set for observations is
This result and the above result for a single compound distribution extend trivially to the case of a distribution over a vector-valued observation, such as a multivariate Gaussian distribution.
Relation to Gibbs sampling
[edit]Collapsing out a node in a collapsed Gibbs sampler is equivalent to compounding. As a result, when a set of independent identically distributed (i.i.d.) nodes all depend on the same prior node, and that node is collapsed out, the resulting conditional probability of one node given the others as well as the parents of the collapsed-out node (but not conditioning on any other nodes, e.g. any child nodes) is the same as the posterior predictive distribution of all the remaining i.i.d. nodes (or more correctly, formerly i.i.d. nodes, since collapsing introduces dependencies among the nodes). That is, it is generally possible to implement collapsing out of a node simply by attaching all parents of the node directly to all children, and replacing the former conditional probability distribution associated with each child with the corresponding posterior predictive distribution for the child conditioned on its parents and the other formerly i.i.d. nodes that were also children of the removed node. For an example, for more specific discussion and for some cautions about certain tricky issues, see the Dirichlet-multinomial distribution article.
See also
[edit]References
[edit]- ^ "Posterior Predictive Distribution". SAS. Retrieved 19 July 2014.
- ^ Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis (Third ed.). Chapman and Hall/CRC. p. 7. ISBN 978-1-4398-4095-5.
Further reading
[edit]- Ntzoufras, Ioannis (2009). "The Predictive Distribution and Model Checking". Bayesian Modeling Using WinBUGS. Wiley. ISBN 978-0-470-14114-4.
Posterior predictive distribution
View on GrokipediaFundamentals
Definition and Bayesian Context
In Bayesian inference, the prior distribution encodes initial beliefs about the unknown parameters before observing any data. The likelihood function then quantifies the probability of the observed data given those parameters. Updating these with the data yields the posterior distribution , which represents the refined beliefs about after incorporating the evidence from .[1] The posterior predictive distribution extends this framework to forecast unobserved future data conditional on the observed data . It is formally defined as where the integral averages the conditional distribution of new data over the entire posterior uncertainty in . This approach inherently accounts for parameter variability, providing a full probabilistic prediction rather than a point estimate.[1][2] The motivation for the posterior predictive distribution lies in its ability to generate predictions that reflect both model structure and epistemic uncertainty, enabling robust assessments of future observations without assuming fixed parameters. By marginalizing over , it avoids overconfidence in any single parameter value and supports decision-making under uncertainty in fields like forecasting and simulation.[1] The concept traces its roots to Pierre-Simon Laplace's 18th-century work on inverse probability, where he first explored updating probabilities based on data to infer causes and predict outcomes. It was formalized within modern Bayesian statistics during the 20th century, notably in foundational treatments that emphasized predictive inference as a core application of the posterior.[3][2]Prior Predictive Distribution
The prior predictive distribution, denoted as , is the marginal distribution of a new observable data point obtained by integrating the likelihood over the prior distribution of the parameters : This formulation arises from marginalizing the joint distribution with respect to , yielding the unconditional distribution of the data under the prior alone.[1] This distribution encodes the researcher's beliefs about possible data outcomes before any observations are made, serving as a tool for eliciting and validating prior specifications in Bayesian models. It allows practitioners to simulate hypothetical datasets from the prior to assess whether the implied data-generating process aligns with domain knowledge or expected variability.[1] A simple example occurs in the conjugate case of a normal likelihood with a normal prior. Suppose the prior is and the likelihood for a new observation is with known variance . The prior predictive distribution then simplifies to a closed-form normal: , reflecting the combined uncertainty from the prior and sampling variance. This result highlights how conjugacy facilitates analytical tractability for prior predictions.[1]Posterior Predictive Distribution
The posterior predictive distribution describes the conditional probability of future or unobserved data given observed data , obtained by marginalizing over the posterior distribution of the model parameters . It is formally expressed as where the posterior with prior and marginal likelihood .[1] This formulation arises naturally in Bayesian inference as a way to generate predictions that fully incorporate updated beliefs about after observing .[1] A key property of the posterior predictive distribution is its integration of parameter uncertainty, which accounts for both the variability in the data-generating process and the remaining doubt about after conditioning on ; consequently, it yields interval predictions that are generally wider than those derived from plug-in likelihood estimates using point estimates of .[1] Under standard regularity conditions and a correctly specified model, as the amount of observed data grows, the posterior predictive distribution asymptotically approximates the true sampling distribution of future observations, providing a bridge to frequentist interpretations in large samples.[1] Computing the posterior predictive distribution involves evaluating the integral, which admits closed-form solutions in conjugate models but becomes intractable in non-conjugate or high-dimensional settings, often requiring numerical approximations such as Markov chain Monte Carlo (MCMC) methods to sample from the posterior and then simulate from the conditional likelihood.[1] As a basic illustrative example, consider a binomial likelihood for the number of successes in trials with unknown success probability , paired with a conjugate Beta() prior; the resulting posterior is Beta(), and the posterior predictive distribution for the number of successes in future trials follows a beta-binomial distribution with parameters , , and .[4] This example highlights how the posterior predictive smooths the discrete binomial probabilities, reflecting overdispersion due to uncertainty in .[4]Comparisons and Implications
Differences from Prior Predictive Distribution
The prior predictive distribution, denoted as , encapsulates the uncertainty in future observations arising from both the prior beliefs about parameters and the inherent stochasticity in the data-generating process, without any conditioning on observed data .[1] In contrast, the posterior predictive distribution, , conditions predictions on the observed data, thereby incorporating empirical evidence to update parameter uncertainty via the posterior .[1] This fundamental distinction means the prior predictive reflects a broader range of plausible outcomes driven solely by subjective prior knowledge, while the posterior predictive "shrinks" toward the observed data, leveraging learning from to refine expectations.[1] The posterior predictive can be understood as a conditional form of the prior predictive, expressed through the relationship , where the joint distribution links the two, and is the marginal likelihood serving as a normalizing constant.[1] This updating process transforms the unconditional prior predictive into a data-informed version, effectively averaging the likelihood over the posterior rather than the prior.[1] A key implication for uncertainty quantification is that the posterior predictive distribution typically exhibits lower predictive variance compared to its prior counterpart, due to the concentration of the posterior around values consistent with the data, which reduces the spread induced by parameter uncertainty.[1] For instance, in a normal linear model with unknown mean and known variance, the prior predictive variance includes the full prior variance of the mean plus the data variance, whereas the posterior predictive variance substitutes the smaller posterior variance of the mean, leading to tighter intervals for future predictions.[1] To illustrate these differences visually, consider a simple univariate normal model where the parameter follows a broad normal prior (e.g., mean 0, variance 10). The prior predictive density for a new observation is a normal distribution centered at 0 with high variance (prior variance plus observation variance), appearing as a wide, flat curve. After observing data (e.g., a sample mean of 2), the posterior predictive density shifts its center toward 2 and narrows significantly, reflecting reduced uncertainty and a more peaked distribution that aligns closely with the data. Such plots highlight how observation updates the predictive distribution from diffuse prior expectations to more precise, evidence-based forecasts.[1] This contrast underscores the posterior predictive's role in model checking, where replicated data from it are compared to observed to assess fit.[1]Role in Model Checking and Selection
Posterior predictive checks (PPCs) provide a Bayesian approach to model evaluation by simulating replicated data sets from the posterior predictive distribution and comparing them to the observed data using a test statistic , such as means, variances, or tail probabilities, to assess overall model fit. This method integrates over the posterior distribution of parameters , , allowing researchers to identify systematic discrepancies that indicate model misspecification, such as inadequate capture of data variability or dependence structures.[5] For instance, the posterior predictive p-value, defined as , quantifies the probability of observing a discrepancy at least as extreme as the actual data under the fitted model; values near 0 or 1 suggest poor fit.[6] In Bayesian model selection, the posterior predictive distribution contributes to criteria that evaluate predictive performance across competing models, favoring those with higher expected log predictive densities.[7] Methods like the widely applicable information criterion (WAIC) and leave-one-out cross-validation (LOO) approximate the out-of-sample predictive accuracy , where WAIC decomposes into log pointwise predictive density minus a variance penalty, and LOO uses importance sampling on posterior draws to estimate leave-one-out predictive densities without refitting the model.[8] These metrics enable ranking of models by their ability to predict new data, with higher values indicating better generalization; for example, in comparing hierarchical models, WAIC or LOO can select the structure that balances fit and complexity more effectively than in-sample likelihoods.[7] A practical example arises in linear regression, where PPCs simulate from the posterior predictive under a normal error model and examine whether the resulting residuals or 95% predictive intervals align with observed patterns, such as uniform coverage or no heteroscedasticity in the discrepancies.[6] If the observed residuals fall outside the distribution of simulated ones, it signals issues like omitted variables or incorrect error assumptions.[9] Despite their utility, PPCs exhibit limitations, including sensitivity to prior specifications, where informative priors can distort the posterior predictive distribution and lead to misleading fit assessments, particularly in low-data regimes.[5] Additionally, generating sufficient replicates for reliable comparisons incurs high computational cost in complex models, often requiring Markov chain Monte Carlo simulations and potentially thousands of draws, which can be prohibitive without approximations.[7]Formulation in Exponential Families
Prior Predictive in Exponential Families
The exponential family provides a unifying framework for many common probability distributions, parameterized in the formwhere is the natural parameter, is the sufficient statistic, is the base measure, and is the log-normalizer ensuring integrability. This parameterization facilitates analytical tractability in Bayesian inference, particularly when paired with conjugate priors that preserve the family structure upon updating.[10] The prior predictive distribution for a new observation under an exponential family likelihood integrates over the prior :
This integral often simplifies to a closed form when using conjugate priors, which are chosen to match the exponential family structure, such as a normal prior for a normal likelihood with known variance. In the normal-normal case, the prior predictive distribution is a Student's t-distribution, reflecting the marginalization over the uncertain mean parameter.[1] A prominent closed-form example arises in the Poisson distribution, an exponential family member with natural parameter and sufficient statistic , where a gamma prior on the rate yields a negative binomial prior predictive distribution for :
This distribution has mean and variance , exceeding the Poisson variance due to prior incorporation.[1] The prior predictive in exponential families typically exhibits heavier tails than the conditional likelihood , as the integration over prior uncertainty in introduces additional variability, broadening the predictive support and enhancing robustness to model misspecification.[1]
Posterior Predictive in Exponential Families
In exponential families, the use of conjugate priors enables closed-form expressions for the posterior predictive distribution, facilitating exact Bayesian inference without relying on approximation methods. The likelihood for observed data takes the form , where is the natural parameter, is the sufficient statistic, is the log-partition function, and is the base measure. A conjugate prior for is given by , where is the normalizing constant, encodes prior sufficient statistics, and reflects prior sample size.[10] Upon observing the data, the posterior updates straightforwardly to , with updated hyperparameters and . This preservation of the conjugate family form allows the posterior predictive distribution for a new observation to be derived as . This expression yields analytically tractable distributions specific to the exponential family member, such as the beta-binomial for binomial likelihoods with beta priors or the Student-t for normal likelihoods with appropriate conjugate priors on mean and variance.[10] A prominent example is the binomial likelihood with a beta prior. For independent Bernoulli trials with success probability , the likelihood is binomial, and the conjugate beta prior (where in natural parameterization) updates to a posterior , with and . The resulting posterior predictive for a new set of trials is the beta-binomial distribution: , which accounts for both data variability and parameter uncertainty.[10][11] Another key case involves the normal likelihood with a conjugate prior on the mean and precision. For observations with known , a normal prior leads to a normal posterior predictive. However, when incorporating uncertainty in the variance via a normal-inverse-gamma prior (conjugate for the mean and precision), the posterior predictive distribution for a new observation is a Student-t: , where the degrees of freedom , location , and scale update based on the data and prior hyperparameters, and the variance of this distribution is . This heavier-tailed predictive reflects epistemic uncertainty in both parameters.[12][13] The primary advantages of this framework lie in its computational tractability: the integrals for the marginal likelihood and predictive are exact and avoid simulation-based methods like MCMC, enabling efficient inference even for moderate datasets. This exactness is particularly valuable in hierarchical models or when multiple predictions are needed, as it provides closed-form uncertainty quantification without approximation error.[10]Joint Predictive Distribution and Marginal Likelihood
In Bayesian inference within exponential families equipped with conjugate priors, the joint predictive distribution for observed data and future data is given bywhere is the prior density on the parameter . This integral represents the marginal probability of the combined dataset . [10] For likelihoods in the exponential family form , where is the sufficient statistic, the natural parameter, the base measure, and the log-normalizer, the conjugate prior takes the form , with hyperparameters and , and the normalizing constant. In this setup, the joint predictive admits a closed-form expression via updated sufficient statistics:
where and is the sufficient statistic for the future data. This form arises because the joint treats as a single augmented sample from the exponential family, updating the prior hyperparameters accordingly. [10] The marginal likelihood, or evidence, for the observed data is
which is computed as the ratio of normalizing constants from the posterior to the prior, reflecting the change in sufficient statistics induced by the data. [10] The posterior predictive distribution relates directly to these quantities as , which simplifies to
in the conjugate exponential family case, facilitating evidence-based updates in inference. [10] A concrete example occurs in the Gamma-Poisson model, where the likelihood is Poisson with rate and the prior is Gamma(), a conjugate pair for the exponential family representation of the Poisson. Here, the joint predictive for observed counts (sum ) and a future count yields a negative binomial distribution for the combined total , with updated shape parameter and scale adjusted by the total exposure, demonstrating how the joint form extends the marginal to augmented data.
