Recent from talks
Nothing was collected or created yet.
Prior probability
View on Wikipedia| Part of a series on |
| Bayesian statistics |
|---|
| Posterior = Likelihood × Prior ÷ Evidence |
| Background |
| Model building |
| Posterior approximation |
| Estimators |
| Evidence approximation |
| Model evaluation |
A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.
In Bayesian statistics, Bayes' rule prescribes how to update the prior with new information to obtain the posterior probability distribution, which is the conditional distribution of the uncertain quantity given new data. Historically, the choice of priors was often constrained to a conjugate family of a given likelihood function, so that it would result in a tractable posterior of the same family. The widespread availability of Markov chain Monte Carlo methods, however, has made this less of a concern.
There are many ways to construct a prior distribution.[1] In some cases, a prior may be determined from past information, such as previous experiments. A prior can also be elicited from the purely subjective assessment of an experienced expert.[2][3][4] When no information is available, an uninformative prior may be adopted as justified by the principle of indifference.[5][6] In modern applications, priors are also often chosen for their mechanical properties, such as regularization and feature selection.[7][8][9]
The prior distributions of model parameters will often depend on parameters of their own. Uncertainty about these hyperparameters can, in turn, be expressed as hyperprior probability distributions. For example, if one uses a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:
- p is a parameter of the underlying system (Bernoulli distribution), and
- α and β are parameters of the prior distribution (beta distribution); hence hyperparameters.
In principle, priors can be decomposed into many conditional levels of distributions, so-called hierarchical priors.[10]
Informative priors
[edit]An informative prior expresses specific, definite information about a variable. An example is a prior distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior a normal distribution with expected value equal to today's noontime temperature, with variance equal to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that day of the year.
This example has a property in common with many priors, namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and, as more evidence accumulates, the posterior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting. The terms "prior" and "posterior" are generally relative to a specific datum or observation.
Strong prior
[edit]A strong prior is a preceding assumption, theory, concept or idea upon which, after taking account of new information, a current assumption, theory, concept or idea is founded.[citation needed] A strong prior is a type of informative prior in which the information contained in the prior distribution dominates the information contained in the data being analyzed. The Bayesian analysis combines the information contained in the prior with that extracted from the data to produce the posterior distribution which, in the case of a "strong prior", would be little changed from the prior distribution.
Weakly informative priors
[edit]A weakly informative prior expresses partial information about a variable, steering the analysis toward solutions that align with existing knowledge without overly constraining the results and preventing extreme estimates. An example is, when setting the prior distribution for the temperature at noon tomorrow in St. Louis, to use a normal distribution with mean 50 degrees Fahrenheit and standard deviation 40 degrees, which very loosely constrains the temperature to the range (10 degrees, 90 degrees) with a small chance of being below -30 degrees or above 130 degrees. The purpose of a weakly informative prior is for regularization, that is, to keep inferences in a reasonable range.
Uninformative priors
[edit]An uninformative, flat, or diffuse prior expresses vague or general information about a variable.[5] The term "uninformative prior" is somewhat of a misnomer. Such a prior might also be called a not very informative prior, or an objective prior, i.e., one that is not subjectively elicited.
Uninformative priors can express "objective" information such as "the variable is positive" or "the variable is less than some limit". The simplest and oldest rule for determining a non-informative prior is the principle of indifference, which assigns equal probabilities to all possibilities. In parameter estimation problems, the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis, as the likelihood function often yields more information than the uninformative prior.
Some attempts have been made at finding a priori probabilities, i.e., probability distributions in some sense logically required by the nature of one's state of uncertainty; these are a subject of philosophical controversy, with Bayesians being roughly divided into two schools: "objective Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who believe that in practice priors usually represent subjective judgements of opinion that cannot be rigorously justified (Williamson 2010). Perhaps the strongest arguments for objective Bayesianism were given by Edwin T. Jaynes, based mainly on the consequences of symmetries and on the principle of maximum entropy.
As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a ball has been hidden under one of three cups, A, B, or C, but no other information is available about its location. In this case a uniform prior of p(A) = p(B) = p(C) = 1/3 seems intuitively like the only reasonable choice. More formally, we can see that the problem remains the same if we swap around the labels ("A", "B" and "C") of the cups. It would therefore be odd to choose a prior for which a permutation of the labels would cause a change in our predictions about which cup the ball will be found under; the uniform prior is the only one which preserves this invariance. If one accepts this invariance principle then one can see that the uniform prior is the logically correct prior to represent this state of knowledge. This prior is "objective" in the sense of being the correct choice to represent a particular state of knowledge, but it is not objective in the sense of being an observer-independent feature of the world: in reality the ball exists under a particular cup, and it only makes sense to speak of probabilities in this situation if there is an observer with limited knowledge about the system.[11]
As a more contentious example, Jaynes published an argument based on the invariance of the prior under a change of parameters that suggests that the prior representing complete uncertainty about a probability should be the Haldane prior p−1(1 − p)−1.[12] The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments. The Haldane prior[13] gives by far the most weight to and , indicating that the sample will either dissolve every time or never dissolve, with equal probability. However, if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the uniform distribution on the interval [0, 1]. This is obtained by applying Bayes' theorem to the data set consisting of one observation of dissolving and one of not dissolving, using the above prior. The Haldane prior is an improper prior distribution (meaning that it has an infinite mass). Harold Jeffreys devised a systematic way for designing uninformative priors as e.g., Jeffreys prior p−1/2(1 − p)−1/2 for the Bernoulli random variable.
Priors can be constructed which are proportional to the Haar measure if the parameter space X carries a natural group structure which leaves invariant our Bayesian state of knowledge.[12] This can be seen as a generalisation of the invariance principle used to justify the uniform prior over the three cups in the example above. For example, in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system. This induces the group structure of the translation group on X, which determines the prior probability as a constant improper prior. Similarly, some measurements are naturally invariant to the choice of an arbitrary scale (e.g., whether centimeters or inches are used, the physical results should be equal). In such a case, the scale group is the natural group structure, and the corresponding prior on X is proportional to 1/x. It sometimes matters whether we use the left-invariant or right-invariant Haar measure. For example, the left and right invariant Haar measures on the affine group are not equal. Berger (1985, p. 413) argues that the right-invariant Haar measure is the correct choice.
Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum entropy (MAXENT). The motivation is that the Shannon entropy of a probability distribution measures the amount of information contained in the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on X, one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and unit variance is the standard normal distribution. The principle of minimum cross-entropy generalizes MAXENT to the case of "updating" an arbitrary prior distribution with suitable constraints in the maximum-entropy sense.
A related idea, reference priors, was introduced by José-Miguel Bernardo. Here, the idea is to maximize the expected Kullback–Leibler divergence of the posterior distribution relative to the prior. This maximizes the expected posterior information about X when the prior density is p(x); thus, in some sense, p(x) is the "least informative" prior about X. The reference prior is defined in the asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points goes to infinity. In the present case, the KL divergence between the prior and posterior distributions is given by
Here, is a sufficient statistic for some parameter . The inner integral is the KL divergence between the posterior and prior distributions and the result is the weighted mean over all values of . Splitting the logarithm into two parts, reversing the order of integrals in the second part and noting that does not depend on yields
The inner integral in the second part is the integral over of the joint density . This is the marginal distribution , so we have
Now we use the concept of entropy which, in the case of probability distributions, is the negative expected value of the logarithm of the probability mass or density function or Using this in the last equation yields
In words, KL is the negative expected value over of the entropy of conditional on plus the marginal (i.e., unconditional) entropy of . In the limiting case where the sample size tends to infinity, the Bernstein-von Mises theorem states that the distribution of conditional on a given observed value of is normal with a variance equal to the reciprocal of the Fisher information at the 'true' value of . The entropy of a normal density function is equal to half the logarithm of where is the variance of the distribution. In this case therefore where is the arbitrarily large sample size (to which Fisher information is proportional) and is the 'true' value. Since this does not depend on it can be taken out of the integral, and as this integral is over a probability space it equals one. Hence we can write the asymptotic form of KL as where is proportional to the (asymptotically large) sample size. We do not know the value of . Indeed, the very idea goes against the philosophy of Bayesian inference in which 'true' values of parameters are replaced by prior and posterior distributions. So we remove by replacing it with and taking the expected value of the normal entropy, which we obtain by multiplying by and integrating over . This allows us to combine the logarithms yielding
This is a quasi-KL divergence ("quasi" in the sense that the square root of the Fisher information may be the kernel of an improper distribution). Due to the minus sign, we need to minimise this in order to maximise the KL divergence with which we started. The minimum value of the last equation occurs where the two distributions in the logarithm argument, improper or not, do not diverge. This in turn occurs when the prior distribution is proportional to the square root of the Fisher information of the likelihood function. Hence in the single parameter case, reference priors and Jeffreys priors are identical, even though Jeffreys has a very different rationale.
Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., Jeffreys' rule) may result in priors with problematic behavior.[clarification needed A Jeffreys prior is related to KL divergence?]
Objective prior distributions may also be derived from other principles, such as information or coding theory (see e.g., minimum description length) or frequentist statistics (so-called probability matching priors).[14] Such methods are used in Solomonoff's theory of inductive inference. Constructing objective priors have been recently introduced in bioinformatics, and specially inference in cancer systems biology, where sample size is limited and a vast amount of prior knowledge is available. In these methods, either an information theory based criterion, such as KL divergence or log-likelihood function for binary supervised learning problems[15] and mixture model problems.[16]
Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred. Jaynes' method of transformation groups can answer this question in some situations.[17]
Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely, and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, the logarithmic prior, which is the uniform prior on the logarithm of proportion. The Jeffreys prior attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion p is p−1/2(1 − p)−1/2, which differs from Jaynes' recommendation.
Priors based on notions of algorithmic probability are used in inductive inference as a basis for induction in very general settings.
Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used routinely, i.e., with many different data sets, it should have good frequentist properties. Normally a Bayesian would not be concerned with such issues, but it can be important in this situation. For example, one would want any decision rule based on the posterior distribution to be admissible under the adopted loss function. Unfortunately, admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with hierarchical Bayes models; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy.
Improper priors
[edit]Let events be mutually exclusive and exhaustive. If Bayes' theorem is written as then it is clear that the same result would be obtained if all the prior probabilities P(Ai) and P(Aj) were multiplied by a given constant; the same would be true for a continuous random variable. If the summation in the denominator converges, the posterior probabilities will still sum (or integrate) to 1 even if the prior values do not, and so the priors may only need to be specified in the correct proportion. Taking this idea further, in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. When this is the case, the prior is called an improper prior. However, the posterior distribution need not be a proper distribution if the prior is improper.[18] This is clear from the case where event B is independent of all of the Aj.
Statisticians sometimes use improper priors as uninformative priors.[19] For example, if they need a prior distribution for the mean and variance of a random variable, they may assume p(m, v) ~ 1/v (for v > 0) which would suggest that any value for the mean is "equally likely" and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996)[citation needed] warn against the danger of over-interpreting those priors since they are not probability densities. The only relevance they have is found in the corresponding posterior, as long as it is well-defined for all observations. (The Haldane prior is a typical counterexample.[clarification needed][citation needed])
By contrast, likelihood functions do not need to be integrated, and a likelihood function that is uniformly 1 corresponds to the absence of data (all models are equally likely, given no data): Bayes' rule multiplies a prior by the likelihood, and an empty product is just the constant likelihood 1. However, without starting with a prior probability distribution, one does not end up getting a posterior probability distribution, and thus cannot integrate or compute expected values or loss. See Likelihood function § Non-integrability for details.
Examples
[edit]Examples of improper priors include:
- The uniform distribution on an infinite interval (i.e., a half-line or the entire real line).
- Beta(0,0), the beta distribution for α=0, β=0 (uniform distribution on log-odds scale).
- The logarithmic prior on the positive reals (uniform distribution on log scale).[citation needed]
These functions, interpreted as uniform distributions, can also be interpreted as the likelihood function in the absence of data, but are not proper priors.
Prior probability in statistical mechanics
[edit]While in Bayesian statistics the prior probability is used to represent initial beliefs about an uncertain parameter, in statistical mechanics the a priori probability is used to describe the initial state of a system.[20] The classical version is defined as the ratio of the number of elementary events (e.g., the number of times a die is thrown) to the total number of events—and these considered purely deductively, i.e., without any experimenting. In the case of the die if we look at it on the table without throwing it, each elementary event is reasoned deductively to have the same probability—thus the probability of each outcome of an imaginary throwing of the (perfect) die or simply by counting the number of faces is 1/6. Each face of the die appears with equal probability—probability being a measure defined for each elementary event. The result is different if we throw the die twenty times and ask how many times (out of 20) the number 6 appears on the upper face. In this case time comes into play and we have a different type of probability depending on time or the number of times the die is thrown. On the other hand, the a priori probability is independent of time—you can look at the die on the table as long as you like without touching it and you deduce the probability for the number 6 to appear on the upper face is 1/6.
In statistical mechanics, e.g., that of a gas contained in a finite volume , both the spatial coordinates and the momentum coordinates of the individual gas elements (atoms or molecules) are finite in the phase space spanned by these coordinates. In analogy to the case of the die, the a priori probability is here (in the case of a continuum) proportional to the phase space volume element divided by , and is the number of standing waves (i.e., states) therein, where is the range of the variable and is the range of the variable (here for simplicity considered in one dimension). In 1 dimension (length ) this number or statistical weight or a priori weighting is . In customary 3 dimensions (volume ) the corresponding number can be calculated to be .[21] In order to understand this quantity as giving a number of states in quantum (i.e., wave) mechanics, recall that in quantum mechanics every particle is associated with a matter wave which is the solution of a Schrödinger equation. In the case of free particles (of energy ) like those of a gas in a box of volume such a matter wave is explicitly where are integers. The number of different values and hence states in the region between is then found to be the above expression by considering the area covered by these points. Moreover, in view of the uncertainty relation, which in 1 spatial dimension is these states are indistinguishable (i.e., these states do not carry labels). An important consequence is a result known as Liouville's theorem, i.e., the time independence of this phase space volume element and thus of the a priori probability. A time dependence of this quantity would imply known information about the dynamics of the system, and hence would not be an a priori probability.[22] Thus the region when differentiated with respect to time yields zero (with the help of Hamilton's equations): The volume at time is the same as at time zero. One describes this also as conservation of information.
In the full quantum theory one has an analogous conservation law. In this case, the phase space region is replaced by a subspace of the space of states expressed in terms of a projection operator , and instead of the probability in phase space, one has the probability density where is the dimensionality of the subspace. The conservation law in this case is expressed by the unitarity of the S-matrix. In either case, the considerations assume a closed isolated system. This closed isolated system is a system with (1) a fixed energy and (2) a fixed number of particles in (c) a state of equilibrium. If one considers a huge number of replicas of this system, one obtains what is called a microcanonical ensemble. It is for this system that one postulates in quantum statistics the "fundamental postulate of equal a priori probabilities of an isolated system." This says that the isolated system in equilibrium occupies each of its accessible states with the same probability. This fundamental postulate therefore allows us to equate the a priori probability to the degeneracy of a system, i.e., to the number of different states with the same energy.
Example
[edit]The following example illustrates the a priori probability (or a priori weighting) in (a) classical and (b) quantal contexts.
- Classical a priori probability
Consider the rotational energy E of a diatomic molecule with moment of inertia I in spherical polar coordinates (this means above is here ), i.e. The -curve for constant E and is an ellipse of area By integrating over and the total volume of phase space covered for constant energy E is and hence the classical a priori weighting in the energy range is
- (phase space volume at ) minus (phase space volume at ) is given by
- Quantum a priori probability
Assuming that the number of quantum states in a range for each direction of motion is given, per element, by a factor , the number of states in the energy range dE is, as seen under (a) for the rotating diatomic molecule. From wave mechanics it is known that the energy levels of a rotating diatomic molecule are given by each such level being (2n+1)-fold degenerate. By evaluating one obtains Thus by comparison with above, one finds that the approximate number of states in the range dE is given by the degeneracy, i.e. Thus the a priori weighting in the classical context (a) corresponds to the a priori weighting here in the quantal context (b). In the case of the one-dimensional simple harmonic oscillator of natural frequency one finds correspondingly: (a) , and (b) (no degeneracy). Thus in quantum mechanics the a priori probability is effectively a measure of the degeneracy, i.e. the number of states having the same energy.
In the case of the hydrogen atom or Coulomb potential (where the evaluation of the phase space volume for constant energy is more complicated) one knows that the quantum mechanical degeneracy is with . Thus in this case .
Priori probability and distribution functions
[edit]In statistical mechanics (see any book) one derives the so-called distribution functions for various statistics. In the case of Fermi–Dirac statistics and Bose–Einstein statistics these functions are respectively These functions are derived for (1) a system in dynamic equilibrium (i.e., under steady, uniform conditions) with (2) total (and huge) number of particles (this condition determines the constant ), and (3) total energy , i.e., with each of the particles having the energy . An important aspect in the derivation is the taking into account of the indistinguishability of particles and states in quantum statistics, i.e., there particles and states do not have labels. In the case of fermions, like electrons, obeying the Pauli principle (only one particle per state or none allowed), one has therefore Thus is a measure of the fraction of states actually occupied by electrons at energy and temperature . On the other hand, the a priori probability is a measure of the number of wave mechanical states available. Hence Since is constant under uniform conditions (as many particles as flow out of a volume element also flow in steadily, so that the situation in the element appears static), i.e., independent of time , and is also independent of time as shown earlier, we obtain Expressing this equation in terms of its partial derivatives, one obtains the Boltzmann transport equation. How do coordinates etc. appear here suddenly? Above no mention was made of electric or other fields. Thus with no such fields present we have the Fermi-Dirac distribution as above. But with such fields present we have this additional dependence of .
See also
[edit]Notes
[edit]- ^ Robert, Christian (1994). "From Prior Information to Prior Distributions". The Bayesian Choice. New York: Springer. pp. 89–136. ISBN 0-387-94296-3.
- ^ Chaloner, Kathryn (1996). "Elicitation of Prior Distributions". In Berry, Donald A.; Stangl, Dalene (eds.). Bayesian Biostatistics. New York: Marcel Dekker. pp. 141–156. ISBN 0-8247-9334-X.
- ^ Mikkola, Petrus; et al. (2024). "Prior Knowledge Elicitation: The Past, Present, and Future". Bayesian Analysis. 19 (4). doi:10.1214/23-BA1381. hdl:11336/183197. S2CID 244798734.
- ^ Icazatti, Alejandro; Abril-Pla, Oriol; Klami, Arto; Martin, Osvaldo A. (September 2023). "PreliZ: A tool-box for prior elicitation". Journal of Open Source Software. 8 (89): 5499. Bibcode:2023JOSS....8.5499I. doi:10.21105/joss.05499.
- ^ a b Zellner, Arnold (1971). "Prior Distributions to Represent 'Knowing Little'". An Introduction to Bayesian Inference in Econometrics. New York: John Wiley & Sons. pp. 41–53. ISBN 0-471-98165-6.
- ^ Price, Harold J.; Manson, Allison R. (2001). "Uninformative priors for Bayes' theorem". AIP Conf. Proc. 617: 379–391. doi:10.1063/1.1477060.
- ^ Piironen, Juho; Vehtari, Aki (2017). "Sparsity information and regularization in the horseshoe and other shrinkage priors". Electronic Journal of Statistics. 11 (2): 5018–5051. arXiv:1707.01694. doi:10.1214/17-EJS1337SI.
- ^ Simpson, Daniel; et al. (2017). "Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors". Statistical Science. 32 (1): 1–28. arXiv:1403.4630. doi:10.1214/16-STS576. S2CID 88513041.
- ^ Fortuin, Vincent (2022). "Priors in Bayesian Deep Learning: A Review". International Statistical Review. 90 (3): 563–591. doi:10.1111/insr.12502. hdl:20.500.11850/547969. S2CID 234681651.
- ^ Congdon, Peter D. (2020). "Regression Techniques using Hierarchical Priors". Bayesian Hierarchical Models (2nd ed.). Boca Raton: CRC Press. pp. 253–315. ISBN 978-1-03-217715-1.
- ^ Florens, Jean-Pierre; Mouchart, Michael; Rolin, Jean-Marie (1990). "Invariance Arguments in Bayesian Statistics". Economic Decision-Making: Games, Econometrics and Optimisation. North-Holland. pp. 351–367. ISBN 0-444-88422-X.
- ^ a b Jaynes, Edwin T. (Sep 1968). "Prior Probabilities" (PDF). IEEE Transactions on Systems Science and Cybernetics. 4 (3): 227–241. doi:10.1109/TSSC.1968.300117.
- ^ This prior was proposed by J.B.S. Haldane in "A note on inverse probability", Mathematical Proceedings of the Cambridge Philosophical Society 28, 55–61, 1932, doi:10.1017/S0305004100010495. See also J. Haldane, "The precision of observed values of small frequencies", Biometrika, 35:297–300, 1948, doi:10.2307/2332350, JSTOR 2332350.
- ^ Datta, Gauri Sankar; Mukerjee, Rahul (2004). Probability Matching Priors: Higher Order Asymptotics. Springer. ISBN 978-0-387-20329-4.
- ^ Esfahani, M. S.; Dougherty, E. R. (2014). "Incorporation of Biological Pathway Knowledge in the Construction of Priors for Optimal Bayesian Classification - IEEE Journals & Magazine". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 11 (1): 202–18. doi:10.1109/TCBB.2013.143. PMID 26355519. S2CID 10096507.
- ^ Boluki, Shahin; Esfahani, Mohammad Shahrokh; Qian, Xiaoning; Dougherty, Edward R (December 2017). "Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors". BMC Bioinformatics. 18 (S14): 552. doi:10.1186/s12859-017-1893-4. ISSN 1471-2105. PMC 5751802. PMID 29297278.
- ^ Jaynes (1968), pp. 17, see also Jaynes (2003), chapter 12. Note that chapter 12 is not available in the online preprint but can be previewed via Google Books.
- ^ Dawid, A. P.; Stone, M.; Zidek, J. V. (1973). "Marginalization Paradoxes in Bayesian and Structural Inference". Journal of the Royal Statistical Society. Series B (Methodological). 35 (2): 189–233. doi:10.1111/j.2517-6161.1973.tb00952.x. JSTOR 2984907.
- ^ Christensen, Ronald; Johnson, Wesley; Branscum, Adam; Hanson, Timothy E. (2010). Bayesian Ideas and Data Analysis : An Introduction for Scientists and Statisticians. Hoboken: CRC Press. p. 69. ISBN 9781439894798.
- ^ Iba, Y. (1989). "Bayesian Statistics and Statistical Mechanics". In Takayama, H. (ed.). Cooperative Dynamics in Complex Physical Systems. Springer Series in Synergetics. Vol. 43. Berlin: Springer. pp. 235–236. doi:10.1007/978-3-642-74554-6_60. ISBN 978-3-642-74556-0.
- ^ Müller-Kirsten, H. J. W. (2013). Basics of Statistical Physics (2nd ed.). Singapore: World Scientific. Chapter 6.
- ^ Ben-Naim, A. (2007). Entropy Demystified. Singapore: World Scientific.
References
[edit]- Bauwens, Luc; Lubrano, Michel; Richard, Jean-François (1999). "Prior Densities for the Regression Model". Bayesian Inference in Dynamic Econometric Models. Oxford University Press. pp. 94–128. ISBN 0-19-877313-7.
- Rubin, Donald B.; Gelman, Andrew; John B. Carlin; Stern, Hal (2003). Bayesian Data Analysis (2nd ed.). Boca Raton: Chapman & Hall/CRC. ISBN 978-1-58488-388-3. MR 2027492.
- Berger, James O. (1985). Statistical decision theory and Bayesian analysis. Berlin: Springer-Verlag. ISBN 978-0-387-96098-2. MR 0804611.
- Berger, James O.; Strawderman, William E. (1996). "Choice of hierarchical priors: admissibility in estimation of normal means". Annals of Statistics. 24 (3): 931–951. doi:10.1214/aos/1032526950. MR 1401831. Zbl 0865.62004.
- Bernardo, Jose M. (1979). "Reference Posterior Distributions for Bayesian Inference". Journal of the Royal Statistical Society, Series B. 41 (2): 113–147. doi:10.1111/j.2517-6161.1979.tb01066.x. JSTOR 2985028. MR 0547240.
- James O. Berger; José M. Bernardo; Dongchu Sun (2009). "The formal definition of reference priors". Annals of Statistics. 37 (2): 905–938. arXiv:0904.0156. Bibcode:2009arXiv0904.0156B. doi:10.1214/07-AOS587. S2CID 3221355.
- Jaynes, Edwin T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. ISBN 978-0-521-59271-0.
- Williamson, Jon (2010). "review of Bruno di Finetti. Philosophical Lectures on Probability" (PDF). Philosophia Mathematica. 18 (1): 130–135. doi:10.1093/philmat/nkp019. Archived from the original (PDF) on 2011-06-09. Retrieved 2010-07-02.
External links
[edit]- PriorDB a collaborative database of models and their priors
Prior probability
View on GrokipediaFundamentals
Definition and Interpretation
In Bayesian statistics, the prior probability refers to the probability distribution assigned to an unknown parameter θ or hypothesis before observing any data, denoted as p(θ), which encapsulates the initial state of knowledge or belief about θ.[6] This distribution serves as the starting point for inference, representing uncertainty or information available prior to data collection.[7] The interpretation of prior probabilities can be subjective or objective. In the subjective view, priors reflect the personal beliefs or expert knowledge of the analyst, allowing for the incorporation of relevant prior information into the analysis.[8] Conversely, the objective approach seeks priors that are minimally informative or derived from formal principles to ensure reproducibility and lack of personal bias, as discussed in foundational works on Bayesian methodology.[9] Historically, Pierre-Simon Laplace introduced an early objective perspective through his 1812 principle of insufficient reason (also known as the principle of indifference), which posits that in the absence of information favoring one outcome over another, equal probabilities should be assigned to equiprobable possibilities.[10] For continuous parameters, the prior is typically expressed as a probability density function p(θ), which integrates to 1 over the parameter space, while for discrete cases, it is a probability mass function specifying probabilities for each possible value.[11] A simple example illustrates this: consider estimating the bias θ (probability of heads) of a coin with no prior flips observed; a Beta(1,1) prior, which is uniform over [0,1], represents complete ignorance about θ by assigning equal density to all values.[12]Role in Bayesian Inference
In Bayesian inference, the prior probability plays a central role by serving as the initial distribution over possible parameter values or hypotheses before observing the data, which is then updated through Bayes' theorem to form the posterior distribution.[13] Bayes' theorem states that the posterior probability density function is proportional to the product of the likelihood and the prior: where represents the parameters, the observed data, the prior, and the likelihood; this multiplicative structure ensures that the prior directly influences the weighting of the likelihood in yielding updated beliefs about .[14] The full posterior is obtained by normalizing this product: with the evidence acting as the marginal likelihood that ensures the posterior integrates to 1, thereby quantifying the total probability of the data under the model and facilitating model comparison.[15] The posterior distribution encapsulates the updated beliefs, combining the prior's information with the data's evidential content via the likelihood; for predictive purposes, one can marginalize the posterior over to obtain the predictive distribution , where denotes future observations.[14] This integration reflects how the prior shapes not only parameter inference but also forecasts by propagating initial uncertainties through the model.[15] Conceptually, Bayesian updating follows a sequential flow: begin with the prior encoding pre-data knowledge, incorporate the likelihood to reflect data compatibility, and arrive at the posterior as the synthesis, with the evidence serving as a normalizing bridge.[16] A simple discrete example illustrates this in disease testing: suppose the prior probability of having a rare disease is 0.01 (1% prevalence), and a test with 99% sensitivity (true positive rate) and 99% specificity (true negative rate) yields a positive result. The likelihood of a positive test given the disease is 0.99, and given no disease is 0.01 (false positive rate). Applying Bayes' theorem, the posterior probability of having the disease is approximately 0.50, calculated as , demonstrating how the low prior tempers the test's evidential strength to avoid overconfidence.[17]Informative Priors
Strong Priors
Strong priors, also known as highly informative priors, are probability distributions characterized by high concentration around specific values, typically featuring low variance or precision parameters that allow them to dominate the posterior distribution, particularly in scenarios with sparse data. For instance, a normal prior distribution with a small places substantial weight near , effectively constraining the posterior mean toward this value even with limited observations. This concentration reflects strong expert beliefs or accumulated evidence from prior studies, enabling the prior to act as a robust anchor in Bayesian updating.[18] The primary advantage of strong priors lies in their utility for small-sample studies or when reliable domain knowledge is available, as they incorporate substantive information to improve estimation precision and reduce overfitting. In clinical trials, for example, historical data from previous experiments can inform a strong prior on treatment effects, allowing efficient borrowing of information to enhance power without requiring large new samples. This approach is particularly beneficial in pharmaceutical research, where a strong prior derived from prior studies on drug efficacy—such as a normal prior centered on an expected response rate from Phase II trials—can shift the posterior toward the prior mean when new Phase III data is limited, leading to more stable inferences about efficacy.[19] However, strong priors carry the disadvantage of introducing bias if misspecified, as their dominant influence can skew the posterior away from the true parameter value, potentially leading to misleading conclusions. Sensitivity analyses are essential to assess how posterior inferences change under prior perturbations, highlighting the need for careful validation against domain expertise. As a milder alternative, weakly informative priors can provide regularization with less risk of overriding data.[18]Weakly Informative Priors
Weakly informative priors are probability distributions that incorporate a minimal amount of prior knowledge, featuring broad spreads and some structural constraints to ensure computational stability and reasonable posterior inferences without substantially overriding the data. These priors typically employ heavy-tailed distributions such as the Cauchy or Student's t-distribution with low degrees of freedom and large scale parameters, which center around plausible values like zero for regression coefficients while allowing for extreme outcomes if supported by the evidence. For instance, a Cauchy distribution with location 0 and scale 2.5 (or 10 for intercepts) bounds the tails to prevent implausibly large parameter values, yet remains diffuse enough to let the likelihood dominate in most scenarios.[20][21] The primary purpose of weakly informative priors is to mitigate pathological issues in Bayesian inference, such as improper posteriors or infinite variance that can arise from fully non-informative alternatives, while staying close to objectivity by exerting only light regularization. They stabilize estimates in challenging settings like small sample sizes, high-dimensional models, or cases of parameter non-identifiability (e.g., complete separation in logistic regression), where the data alone might yield unstable or extreme results. By introducing just enough structure—such as finite variance and tail decay—these priors promote robust modeling without assuming strong domain-specific beliefs, making them suitable as default choices for exploratory analyses or when prior elicitation is difficult.[20][21] Weakly informative priors gained prominence in the 2010s through the advocacy of Andrew Gelman and collaborators, who emphasized their role in robust Bayesian data analysis via tools like Stan and detailed methodological guidance. In their seminal work, Gelman et al. recommended these priors for hierarchical and regression models to balance flexibility and reliability, influencing their adoption in fields requiring reproducible inference. A representative example occurs in linear regression, where a normal prior on coefficients with mean 0 and a large standard deviation (e.g., 10) provides mild shrinkage toward zero, regularizing the model against overfitting multicollinear predictors while permitting data-driven deviations for truly important effects—thus avoiding the pitfalls of flat priors that can lead to erratic predictions.[21][20]Non-Informative Priors
Objective Priors
Objective priors in Bayesian statistics are probability distributions selected to exert minimal influence on the posterior distribution, thereby allowing the observed data to predominantly determine the inference. These priors aim to represent a state of ignorance or objectivity regarding the parameter values, ensuring that the posterior closely approximates the normalized likelihood when sufficient data are available.[22][23] A prominent example is the Jeffreys prior, derived as the square root of the determinant of the Fisher information matrix, , where quantifies the amount of information the data provide about . This construction, originally proposed by Harold Jeffreys, ensures invariance under smooth reparameterizations of the model, meaning the prior transforms appropriately to maintain the same inferential properties regardless of how the parameter is expressed.[22][24] For instance, in a normal distribution with known variance, the Jeffreys prior for the mean is uniform, , while for the standard deviation with known mean, it is .[22][25] Another key type is the uniform prior, often used for bounded parameters to express uniformity over the possible range. For a proportion parameter in a binomial model, a uniform prior on corresponds to a Beta(1,1) distribution, which integrates to 1 and results in a posterior that is simply the likelihood normalized over the parameter space.[25][23] If successes are observed in trials, the posterior becomes Beta(1 + y, 1 + n - y), directly reflecting the data's evidential content without additional prior weighting.[25] Reference priors extend this objectivity asymptotically, maximizing the expected Kullback-Leibler divergence between the prior and the posterior to ensure the prior adds the least possible information relative to the data. Developed by José M. Bernardo, these priors coincide with Jeffreys priors in one-dimensional cases but provide a more robust approach for multiparameter models by prioritizing parameters of interest.[26][22] This property makes reference priors particularly suitable for achieving consistent inference as sample sizes grow, preserving the data's dominance in the limit.[26]Improper Priors
Improper priors are probability distributions that do not integrate to a finite value over their domain, meaning ∫ p(θ) dθ = ∞, rendering them non-normalizable as formal probability densities.[27] Classic examples include the uniform distribution over the entire real line, (-∞, ∞), and the prior proportional to 1/θ for positive scale parameters θ > 0, both of which assign equal weight across unbounded spaces but fail to sum to unity.[27] Despite their mathematical impropriety, these priors can serve as limiting cases of proper distributions with increasingly diffuse support, facilitating non-informative Bayesian analysis. For inference to be valid, an improper prior must yield a proper posterior distribution, which requires that the integral ∫ p(data|θ) p(θ) dθ remains finite and normalizable, ensuring the posterior integrates to 1.[28] This condition holds when the likelihood p(data|θ) sufficiently bounds the parameter space, dominating the prior's divergence. A prominent example is the Haldane prior, Beta(0,0), which is proportional to p^{-1}(1-p)^{-1} for a binomial success probability p ∈ (0,1) and is improper due to singularities at the boundaries.[29] When combined with binomial data showing at least one success and one failure, the resulting Beta(n+0, m+0) posterior—where n and m are the counts—is proper and equivalent to the maximum likelihood estimate, highlighting the prior's utility in objective settings.[29] These priors offer computational simplicity, as they often lead to analytically tractable posteriors without imposing subjective beliefs, making them appealing for default analyses.[30] However, risks arise if the posterior remains improper, which can occur with insufficient data or ill-posed models, leading to paradoxes such as undefined marginal likelihoods that invalidate model comparisons.[30] Careful verification of posterior propriety is essential to avoid misleading inferences.[28] In linear regression, a flat improper prior on the coefficients β, such as p(β) ∝ 1, paired with a similar prior on the precision 1/σ² ∝ 1/σ², exemplifies these dynamics. Without data, the posterior stays improper, reflecting the model's underidentification.[31] In contrast, with sufficient observations, the likelihood regularizes the posterior into a proper multivariate normal for β (conditional on σ²) and inverse-gamma for σ², enabling standard Bayesian estimates akin to least squares but with uncertainty quantification.[31] This setup underscores how data can salvage inference from improper priors in well-posed problems.[30]Prior Selection
Elicitation Techniques
Elicitation techniques for prior probabilities involve systematic processes to incorporate expert knowledge or existing data into prior distributions within Bayesian analysis. One primary method is expert elicitation through structured questionnaires, which quantify subjective beliefs from domain specialists. The Delphi method, for instance, facilitates this by conducting iterative rounds of anonymous surveys where experts provide probability assessments, followed by feedback on group responses to converge toward consensus and reduce individual biases.[32] This approach is particularly useful when direct data is scarce, allowing experts to express uncertainties via quantiles or intervals that can be aggregated into a prior distribution.[33] Another technique aggregates historical data to inform priors, drawing on past observations or similar studies to construct distributions that reflect accumulated evidence. This often employs hierarchical models to borrow strength across related datasets, ensuring the prior captures patterns without overfitting to any single source.[34] Empirical Bayes methods further refine this by using preliminary data from past datasets to estimate hyperparameters of the prior, treating the marginal likelihood as a basis for selecting a data-driven yet regularized distribution.[33] These data-informed approaches balance objectivity with the need for prior specification in new analyses. Formal approaches to elicitation often encode elicited beliefs into distributional moments, such as mean and variance, before fitting a parametric family like the normal or Beta distribution to match those characteristics. For continuous parameters, experts might provide judgments on expected values and spreads, which are then used to parameterize the prior via maximum likelihood or moment-matching techniques.[35] In discrete cases, such as success probabilities, Beta distributions are commonly fitted to elicited quantiles or odds ratios expressed by experts.[34] Challenges in these techniques include avoiding cognitive biases, such as overconfidence or anchoring, which can distort elicited probabilities and lead to overly narrow priors.[33] Handling uncertainty is also critical, as experts may struggle to quantify second-order uncertainties, necessitating robust aggregation methods to propagate variability into the final prior.00175-9/pdf) Protocols like the SHELF framework address this by incorporating feedback loops and sensitivity checks during elicitation.[36] A representative example involves eliciting a prior for earthquake magnitude from seismologists, where experts provide quantile judgments on expected magnitudes in a region. These assessments are then fitted to a log-normal distribution to capture the skewed nature of seismic events, yielding a prior that informs probabilistic hazard models.[37]Conjugate Priors
In Bayesian statistics, a conjugate prior for a parameter is a prior distribution such that, when combined with a likelihood via Bayes' theorem, the resulting posterior belongs to the same distributional family as the prior. This conjugacy ensures analytical tractability, as the posterior can be obtained by simply updating the prior's hyperparameters based on the observed data. The formalization of conjugate priors traces back to Raiffa and Schlaifer (1961), who emphasized their role in decision-theoretic contexts where sufficient statistics of fixed dimension enable closed-form solutions.[38] Conjugate priors are most naturally defined for likelihoods from exponential families, where the prior is constructed to mimic the likelihood's kernel. Key examples include the beta distribution as conjugate to the binomial or Bernoulli likelihood for modeling success probabilities, the gamma distribution conjugate to the Poisson likelihood for rates, and the normal distribution conjugate to the normal likelihood for mean estimation with known variance. For multivariate settings, the inverse-Wishart distribution serves as the conjugate prior for the covariance matrix of a multivariate normal likelihood. These pairs are summarized in the following table of common conjugate relationships:| Likelihood Model | Parameter(s) | Conjugate Prior Family | Hyperparameters |
|---|---|---|---|
| Bernoulli/Binomial | Beta | , | |
| Poisson | Gamma | , | |
| Normal (known variance) | Normal | Mean , precision | |
| Normal (known mean) | Inverse-gamma | Shape , scale | |
| Multivariate normal | Inverse-Wishart | Degrees of freedom , scale matrix |
