Recent from talks
Contribute something
Nothing was collected or created yet.
Bayesian inference
View on Wikipedia| Part of a series on |
| Bayesian statistics |
|---|
| Posterior = Likelihood × Prior ÷ Evidence |
| Background |
| Model building |
| Posterior approximation |
| Estimators |
| Evidence approximation |
| Model evaluation |
This article may incorporate text from a large language model. (October 2025) |
Bayesian inference (/ˈbeɪziən/ BAY-zee-ən or /ˈbeɪʒən/ BAY-zhən)[1] is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inference uses a prior distribution to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".
Introduction to Bayes' rule
[edit]
Formal explanation
[edit]Hypothesis Evidence |
Satisfies hypothesis H |
Violates hypothesis |
Total | |
|---|---|---|---|---|
| Has evidence E |
| |||
| No evidence |
= | |||
| Total | | 1 | ||
Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:
where
- H stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.
- , the prior probability, is the estimate of the probability of the hypothesis H before the data E, the current evidence, is observed.
- E, the evidence, corresponds to new data that were not used in computing the prior probability.
- , the posterior probability, is the probability of H given E, i.e., after E is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
- is the probability of observing E given H and is called the likelihood. As a function of E with H fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, E, while the posterior probability is a function of the hypothesis, H.
- is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis H does not appear anywhere in the symbol, unlike for all the other factors) and hence does not factor into determining the relative probabilities of different hypotheses.
- (Else one has .)
For different values of H, only the factors and , both in the numerator, affect the value of – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).
In cases where ("not H"), the logical negation of H, is a valid likelihood, Bayes' rule can be rewritten as follows:
because
and
This focuses attention on the term
If that term is approximately 1, then the probability of the hypothesis given the evidence, , is about , about 50% likely - equally likely or not likely. If that term is very small, close to zero, then the probability of the hypothesis, given the evidence, is close to 1 or the conditional hypothesis is quite likely. If that term is very large, much larger than 1, then the hypothesis, given the evidence, is quite unlikely. If the hypothesis (without consideration of evidence) is unlikely, then is small (but not necessarily astronomically small) and is much larger than 1 and this term can be approximated as and relevant probabilities can be compared directly to each other.
One quick and easy way to remember the equation would be to use rule of multiplication:
Alternatives to Bayesian updating
[edit]Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.
Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote:[2] "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."
Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics") following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability.[3] The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.[4]
Inference over exclusive and exhaustive possibilities
[edit]If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.
General formulation
[edit]
Suppose a process is generating independent and identically distributed events , but the probability distribution is unknown. Let the event space represent the current state of belief for this process. Each model is represented by event . The conditional probabilities are specified to define the models. is the degree of belief in . Before the first inference step, is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.
Suppose that the process is observed to generate . For each , the prior is updated to the posterior . From Bayes' theorem:[5]
Upon observation of further evidence, this procedure may be repeated.
Multiple observations
[edit]For a sequence of independent and identically distributed observations , it can be shown by induction that repeated application of the above is equivalent to where
Parametric formulation: motivating the formal description
[edit]By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is, however, equally applicable to discrete distributions.
Let the vector span the parameter space. Let the initial prior distribution over be , where is a set of parameters to the prior itself, or hyperparameters. Let be a sequence of independent and identically distributed event observations, where all are distributed as for some . Bayes' theorem is applied to find the posterior distribution over :
where
Formal description of Bayesian inference
[edit]Definitions
[edit]- , a data point in general. This may in fact be a vector of values.
- , the parameter of the data point's distribution, i.e., . This may be a vector of parameters.
- , the hyperparameter of the parameter distribution, i.e., . This may be a vector of hyperparameters.
- is the sample, a set of observed data points, i.e., .
- , a new data point whose distribution is to be predicted.
Bayesian inference
[edit]- The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. . The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations.
- The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written .
- The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise.[6] If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
- The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference: This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
- In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution is not obtained in a closed form distribution, mainly because the parameter space for can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations and parameter . In such situations, we need to resort to approximation techniques.[7]
- General case: Let be the conditional distribution of given and let be the distribution of . The joint distribution is then . The conditional distribution of given is then determined by
Existence and uniqueness of the needed conditional expectation is a consequence of the Radon–Nikodym theorem. This was formulated by Kolmogorov in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface.[8] The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions.[9] Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line.[10] Modern Markov chain Monte Carlo methods have boosted the importance of Bayes' theorem including cases with improper priors.[11]
Bayesian prediction
[edit]- The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:
- The prior predictive distribution is the distribution of a new data point, marginalized over the prior:
Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.
In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the facts that (1) the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.
Both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.
Mathematical properties
[edit]This section includes a list of general references, but it lacks sufficient corresponding inline citations. (February 2012) |
Interpretation of factor
[edit]. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, . That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.
Cromwell's rule
[edit]If then . If and , then . This can be interpreted to mean that hard convictions are insensitive to counter-evidence.
The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not " in place of "", yielding "if , then ", from which the result immediately follows.
Asymptotic behaviour of posterior
[edit]Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 [12] and 1965 [13] when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces.[14] To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.
Conjugate priors
[edit]In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.
Estimates of parameters and predictions
[edit]It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.
For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.[15]
If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.[16]
Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates:[17]
There are examples where no maximum is attained, in which case the set of MAP estimates is empty.
There are other methods of estimation that minimize the posterior risk (expected-posterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics").[18]
The posterior predictive distribution of a new observation (that is independent of previous observations) is determined by[19]
Examples
[edit]Probability of a hypothesis
[edit]Bowl Cookie |
#1 H1 |
#2 H2 |
Total | |
|---|---|---|---|---|
| Plain, E | 30 | 20 | 50 | |
| Choc, ¬E | 10 | 20 | 30 | |
| Total | 40 | 40 | 80 | |
| P(H1|E) = 30 / 50 = 0.6 | ||||
Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let correspond to bowl #1, and to bowl #2. It is given that the bowls are identical from Fred's point of view, thus , and the two must add up to 1, so both are equal to 0.5. The event is the observation of a plain cookie. From the contents of the bowls, we know that and Bayes' formula then yields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, , which was 0.5. After observing the cookie, we must revise the probability to , which is 0.6.
Making a prediction
[edit]
An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?
The degree of belief in the continuous variable (century) is to be calculated, with the discrete set of events as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,
Assume a uniform prior of , and that trials are independent and identically distributed. When a new fragment of type is discovered, Bayes' theorem is applied to update the degree of belief for each :
A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or . By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The Bernstein-von Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events is finite (see above section on asymptotic behaviour of the posterior).
In frequentist statistics and decision theory
[edit]A decision-theoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.[20]
Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals.[21][22][23] For example:
- "Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."[20]
- "In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."[24]
- "In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."[25]
- "A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"[26]
- "An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained."[27]
Model selection
[edit]Bayesian methodology also plays a role in model selection where the aim is to select one model from a set of competing models that represents most closely the underlying process that generated the observed data. In Bayesian model comparison, the model with the highest posterior probability given the data is selected. The posterior probability of a model depends on the evidence, or marginal likelihood, which reflects the probability that the data is generated by the model, and on the prior belief of the model. When two competing models are a priori considered to be equiprobable, the ratio of their posterior probabilities corresponds to the Bayes factor. Since Bayesian model comparison is aimed on selecting the model with the highest posterior probability, this methodology is also referred to as the maximum a posteriori (MAP) selection rule [28] or the MAP probability rule.[29]
Probabilistic programming
[edit]While conceptually simple, Bayesian methods can be mathematically and numerically challenging. Probabilistic programming languages (PPLs) implement functions to easily build Bayesian models together with efficient automatic inference methods. This helps separate the model building from the inference, allowing practitioners to focus on their specific problems and leaving PPLs to handle the computational details for them.[30][31][32]
Applications
[edit]Statistical data analysis
[edit]See the separate Wikipedia entry on Bayesian statistics, specifically the statistical modeling section in that page.
Computer applications
[edit]Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s.[33] There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes.[34] Recently[when?] Bayesian inference has gained popularity among the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.
As applied to statistical classification, Bayesian inference has been used to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, Mozilla, XEAMS, and others. Spam classification is treated in more detail in the article on the naïve Bayes classifier.
Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and Occam's Razor.[35][unreliable source?] Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.[36][37]
Bioinformatics and healthcare applications
[edit]Bayesian inference has been applied in different bioinformatics applications, including differential gene expression analysis.[38] Bayesian inference is also used in a general cancer risk model, called CIRI (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge.[39][40]
Cosmology and astrophysical applications
[edit]The Bayesian approach has been central to recent progress in cosmology and astrophysical applications,[41][42] and extends to a wide range of astrophysical problems, including the characterisation of exoplanet (such as the fitting of atmosphere for k2-18b[43]), parameter constraints with cosmological data,[44] and calibration in astrophysical experiments.[45]
In cosmology, it is often employed with computational techniques such as Markov chain Monte Carlo (MCMC) and nested sampling algorithm to analyse complex datasets and navigate high-dimensional parameter space. A notable application is the Planck 2018 CMB data for parameter inference.[44] The six base cosmological parameters in Lambda-CDM model are not predicted by a theory, but rather fitted from Cosmic microwave background (CMB) data to a chosen model of cosmology (the Lambda-CDM model).[46] The bayesian code for cosmology `cobaya` [47] sets up cosmological runs and interfaces cosmological likelihoods, Boltzmann code,[48][49] which computes the predicted CMB anisotropies for any given set of cosmological parameters, with MCMC or nested sampler.
This computational framework is not limited to the standard model, it is also essential for testing alternative or extended theories of cosmology, such as theories with early dark energy,[50] or modified gravity theories introducing additional parameters beyond Lambda-CDM. Bayesian model comparison can then be employed to calculate the evidence for competing models, providing a statistical basis to assess whether the data support them over the standard Lambda-CDM.[51]
In the courtroom
[edit]Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for "beyond a reasonable doubt".[52][53][54] Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.[55] For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.
The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."
Gardner-Medwin[56] argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:
- A – the known facts and testimony could have arisen if the defendant is guilty.
- B – the known facts and testimony could have arisen if the defendant is innocent.
- C – the defendant is guilty.
Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.
The O.J. Simpson murder trial
[edit]The O.J. Simpson trial is frequently cited as a classic example of the misuse of Bayes' theorem in legal reasoning. Defense attorney Alan Dershowitz argued that since only 0.1% of men who abuse their wives go on to murder them, Simpson's history of abuse was statistically irrelevant. This reasoning is flawed because it ignores the crucial fact that Simpson's wife had in fact been murdered. Once this information is incorporated, Bayesian analysis shows that the probability of the abusive husband being guilty rises dramatically, estimated at around 81%, making the abuse history significant evidence rather than background noise.[57]
Too good to be true: When overwhelming evidence fails to convince
[edit]This paper[58] uses Bayesian probabilistic framework to study a paradox: situations where accumulating unanimous evidence, a seemingly definitive indicator, can decrease confidence in a hypothesis when the assumptions of independent observations are compromised by even a tiny risk of systemic failure.[58]
Typically, each independent piece of supporting evidence raises confidence in a hypothesis. However, the inclusion of a small probability of systemic error, a failure mode affecting all observations, alters the implications: excessive agreement may actually imply bias rather than truth. Thus, rather than reinforcing belief, too much consistency can generate suspicion and reduced confidence.[58]
Illustrative examples
[edit]Archaeological: The Roman pot
[edit]In one scenario, a clay pot's origin is tested (e.g., whether it's from Britain). As repeated tests indicate the same result, confidence initially increases. However, when a small systemic failure rate (e.g., 1% lab contamination) is introduced, the posterior confidence peaks after a few unanimous results and then declines as continued unanimity increasingly suggests contamination rather than independent confirmation. Eventually, confidence can drop close to 50%, equivalent to random guess, demonstrating that perfection in agreement can erode belief.[58]
Legal: Ancient Jewish law (Talmudic example)
[edit]The paper references an ancient judicial principle: in Jewish law, a defendant cannot be convicted of a capital crime if all judges unanimously find guilt. This rule, while seemingly counter-intuitive, reflects an intuitive recognition of the paradox - that absolute agreement may signal systemic failure rather than reliability.[58]
This aligns with the paper's formal analysis of legal scenarios like witness line-ups, where even a 1% bias causes confidence to decline after approximately three unanimous identifications - further unanimous affirmations make conviction less credible, not more.[58]
The Talmudic rule against unanimous guilty verdicts now finds a formal justification: perfect agreement may be less convincing than moderate, independent agreement.[58]
Cryptography
[edit]In cryptographic systems relying on repeated tests (e.g., Rabin–Miller), hardware or software issues can systematically fail tests. The study shows that ignoring systemic failure can lead to vast underestimation of false-negative rates by factors as large as 2^80.[58]
Bayesian epistemology
[edit]Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.
Karl Popper and David Miller have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences:[59] It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.
Other
[edit]- The scientific method is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about hypotheses conditional on new observations or experiments.[60] The Bayesian inference has also been applied to treat stochastic scheduling problems with incomplete information by Cai et al. (2009).[61]
- Bayesian search theory is used to search for lost objects.
- Bayesian inference in phylogeny
- Bayesian tool for methylation analysis
- Bayesian approaches to brain function investigate the brain as a Bayesian mechanism.
- Bayesian inference in ecological studies[62][63]
- Bayesian inference is used to estimate parameters in stochastic chemical kinetic models[64]
- Bayesian inference in econophysics for currency or prediction of trend changes in financial quotations[65]
- Bayesian inference in marketing
- Bayesian inference in motor learning
- Bayesian inference is used in probabilistic numerics to solve numerical problems
Bayes and Bayesian inference
[edit]The problem considered by Bayes in Proposition 9 of his essay, "An Essay Towards Solving a Problem in the Doctrine of Chances", is the posterior distribution for the parameter a (the success rate) of the binomial distribution.[citation needed]
History
[edit]The term Bayesian refers to Thomas Bayes (1701–1761), who proved that probabilistic limits could be placed on an unknown event.[66] However, it was Pierre-Simon Laplace (1749–1827) who introduced (as Principle VI) what is now called Bayes' theorem and used it to address problems in celestial mechanics, medical statistics, reliability, and jurisprudence.[67] Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes[68]). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics.[68]
In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed,[69] and the method assigning the prior, which differs from one objective Bayesian practitioner to another. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.
In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.[70] Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.[71] Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.[72]
See also
[edit]References
[edit]Citations
[edit]- ^ "Bayesian". Merriam-Webster.com Dictionary. Merriam-Webster.
- ^ Hacking, Ian (December 1967). "Slightly More Realistic Personal Probability". Philosophy of Science. 34 (4): 316. doi:10.1086/288169. S2CID 14344339.
- ^ "Bayes' Theorem (Stanford Encyclopedia of Philosophy)". Plato.stanford.edu. Retrieved 2014-01-05.
- ^ van Fraassen, B. (1989) Laws and Symmetry, Oxford University Press. ISBN 0-19-824860-1.
- ^ Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC. ISBN 978-1-4398-4095-5.
- ^ de Carvalho, Miguel; Page, Garritt; Barney, Bradley (2019). "On the geometry of Bayesian inference" (PDF). Bayesian Analysis. 14 (4): 1013‒1036. doi:10.1214/18-BA1112. S2CID 88521802.
- ^ Lee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-theoretical review". Communications in Statistics – Theory and Methods. 51 (6): 1549–1568. arXiv:2008.01006. doi:10.1080/03610926.2021.1921214. S2CID 220935477.
- ^ Kolmogorov, A.N. (1933) [1956]. Foundations of the Theory of Probability. Chelsea Publishing Company.
- ^ Tjur, Tue (1980). Probability based on Radon measures. Internet Archive. Chichester [Eng.]; New York : Wiley. ISBN 978-0-471-27824-5.
- ^ Taraldsen, Gunnar; Tufto, Jarle; Lindqvist, Bo H. (2021-07-24). "Improper priors and improper posteriors". Scandinavian Journal of Statistics. 49 (3): 969–991. doi:10.1111/sjos.12550. hdl:11250/2984409. ISSN 0303-6898. S2CID 237736986.
- ^ Robert, Christian P.; Casella, George (2004). Monte Carlo Statistical Methods. Springer. ISBN 978-1-4757-4145-2. OCLC 1159112760.
- ^ Freedman, DA (1963). "On the asymptotic behavior of Bayes' estimates in the discrete case". The Annals of Mathematical Statistics. 34 (4): 1386–1403. doi:10.1214/aoms/1177703871. JSTOR 2238346.
- ^ Freedman, DA (1965). "On the asymptotic behavior of Bayes estimates in the discrete case II". The Annals of Mathematical Statistics. 36 (2): 454–456. doi:10.1214/aoms/1177700155. JSTOR 2238150.
- ^ Robins, James; Wasserman, Larry (2000). "Conditioning, likelihood, and coherence: A review of some foundational concepts". Journal of the American Statistical Association. 95 (452): 1340–1346. doi:10.1080/01621459.2000.10474344. S2CID 120767108.
- ^ Sen, Pranab K.; Keating, J. P.; Mason, R. L. (1993). Pitman's measure of closeness: A comparison of statistical estimators. Philadelphia: SIAM.
- ^ Choudhuri, Nidhan; Ghosal, Subhashis; Roy, Anindya (2005-01-01). "Bayesian Methods for Function Estimation". Handbook of Statistics. Bayesian Thinking. Vol. 25. pp. 373–414. CiteSeerX 10.1.1.324.3052. doi:10.1016/s0169-7161(05)25013-7. ISBN 978-0-444-51539-1.
- ^ "Maximum A Posteriori (MAP) Estimation". www.probabilitycourse.com. Retrieved 2017-06-02.
- ^ Yu, Angela. "Introduction to Bayesian Decision Theory" (PDF). cogsci.ucsd.edu/. Archived from the original (PDF) on 2013-02-28.
- ^ Hitchcock, David. "Posterior Predictive Distribution Stat Slide" (PDF). stat.sc.edu.
- ^ a b Bickel & Doksum (2001, p. 32)
- ^ Kiefer, J.; Schwartz R. (1965). "Admissible Bayes Character of T2-, R2-, and Other Fully Invariant Tests for Multivariate Normal Problems". Annals of Mathematical Statistics. 36 (3): 747–770. doi:10.1214/aoms/1177700051.
- ^ Schwartz, R. (1969). "Invariant Proper Bayes Tests for Exponential Families". Annals of Mathematical Statistics. 40: 270–283. doi:10.1214/aoms/1177697822.
- ^ Hwang, J. T. & Casella, George (1982). "Minimax Confidence Sets for the Mean of a Multivariate Normal Distribution" (PDF). Annals of Statistics. 10 (3): 868–881. doi:10.1214/aos/1176345877.
- ^ Lehmann, Erich (1986). Testing Statistical Hypotheses (Second ed.). (see p. 309 of Chapter 6.7 "Admissibility", and pp. 17–18 of Chapter 1.8 "Complete Classes"
- ^ Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag. ISBN 978-0-387-96307-5. (From "Chapter 12 Posterior Distributions and Bayes Solutions", p. 324)
- ^ Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 432. ISBN 978-0-04-121537-3.
- ^ Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 433. ISBN 978-0-04-121537-3.)
- ^ Stoica, P.; Selen, Y. (2004). "A review of information criterion rules". IEEE Signal Processing Magazine. 21 (4): 36–47. doi:10.1109/MSP.2004.1311138. S2CID 17338979.
- ^ Fatermans, J.; Van Aert, S.; den Dekker, A.J. (2019). "The maximum a posteriori probability rule for atom column detection from HAADF STEM images". Ultramicroscopy. 201: 81–91. arXiv:1902.05809. doi:10.1016/j.ultramic.2019.02.003. PMID 30991277. S2CID 104419861.
- ^ Bessiere, P., Mazer, E., Ahuactzin, J. M., & Mekhnacha, K. (2013). Bayesian Programming (1 edition) Chapman and Hall/CRC.
- ^ Daniel Roy (2015). "Probabilistic Programming". probabilistic-programming.org. Archived from the original on 2016-01-10. Retrieved 2020-01-02.
- ^ Ghahramani, Z (2015). "Probabilistic machine learning and artificial intelligence". Nature. 521 (7553): 452–459. Bibcode:2015Natur.521..452G. doi:10.1038/nature14541. PMID 26017444. S2CID 216356.
- ^ Fienberg, Stephen E. (2006-03-01). "When did Bayesian inference become "Bayesian"?". Bayesian Analysis. 1 (1). doi:10.1214/06-BA101.
- ^ Jim Albert (2009). Bayesian Computation with R, Second edition. New York, Dordrecht, etc.: Springer. ISBN 978-0-387-92297-3.
- ^ Rathmanner, Samuel; Hutter, Marcus; Ormerod, Thomas C (2011). "A Philosophical Treatise of Universal Induction". Entropy. 13 (6): 1076–1136. arXiv:1105.5721. Bibcode:2011Entrp..13.1076R. doi:10.3390/e13061076. S2CID 2499910.
- ^ Hutter, Marcus; He, Yang-Hui; Ormerod, Thomas C (2007). "On Universal Prediction and Bayesian Confirmation". Theoretical Computer Science. 384 (2007): 33–48. arXiv:0709.1516. Bibcode:2007arXiv0709.1516H. doi:10.1016/j.tcs.2007.05.016. S2CID 1500830.
- ^ Gács, Peter; Vitányi, Paul M. B. (2 December 2010). "Raymond J. Solomonoff 1926-2009". CiteSeerX 10.1.1.186.8268.
- ^ Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics.
- ^ "CIRI". ciri.stanford.edu. Retrieved 2019-08-11.
- ^ Kurtz, David M.; Esfahani, Mohammad S.; Scherer, Florian; Soo, Joanne; Jin, Michael C.; Liu, Chih Long; Newman, Aaron M.; Dührsen, Ulrich; Hüttmann, Andreas (2019-07-25). "Dynamic Risk Profiling Using Serial Tumor Biomarkers for Personalized Outcome Prediction". Cell. 178 (3): 699–713.e19. doi:10.1016/j.cell.2019.06.011. ISSN 1097-4172. PMC 7380118. PMID 31280963.
- ^ Trotta, Roberto (2017). "Bayesian Methods in Cosmology". arXiv:1701.01467 [astro-ph.CO].
- ^ Staicova, Denitsa (2025). "Modern Bayesian Sampling Methods for Cosmological Inference: A Comparative Study". Universe. 11 (2): 68. arXiv:2501.06022. Bibcode:2025Univ...11...68S. doi:10.3390/universe11020068.
- ^ Madhusudhan, Nikku; Constantinou, Savvas; Holmberg, Måns; Sarkar, Subhajit; Piette, Anjali A. A.; Moses, Julianne I. (2025). "New Constraints on DMS and DMDS in the Atmosphere of K2-18 b from JWST MIRI". The Astrophysical Journal. 983 (2): L40. arXiv:2504.12267. Bibcode:2025ApJ...983L..40M. doi:10.3847/2041-8213/adc1c8.
- ^ a b Aghanim, N.; et al. (2020). "Planck 2018 results". Astronomy & Astrophysics. 641: A6. arXiv:1807.06209. Bibcode:2020A&A...641A...6P. doi:10.1051/0004-6361/201833910.
- ^ Anstey, Dominic; De Lera Acedo, Eloy; Handley, Will (2021). "A general Bayesian framework for foreground modelling and chromaticity correction for global 21 cm experiments". Monthly Notices of the Royal Astronomical Society. 506 (2): 2041–2058. arXiv:2010.09644. doi:10.1093/mnras/stab1765.
- ^ Lewis, Antony; Bridle, Sarah (2002). "Cosmological parameters from CMB and other data: A Monte Carlo approach". Physical Review D. 66 (10) 103511. arXiv:astro-ph/0205436. Bibcode:2002PhRvD..66j3511L. doi:10.1103/PhysRevD.66.103511.
- ^ "Cobaya, a code for Bayesian analysis in Cosmology — cobaya 3.5.7 documentation". cobaya.readthedocs.io. Retrieved 2025-07-23.
- ^ "CAMB — Code for Anisotropies in the Microwave Background (CAMB) 1.6.1 documentation". camb.readthedocs.io. Retrieved 2025-07-23.
- ^ Lesgourgues, Julien (2011). "The Cosmic Linear Anisotropy Solving System (CLASS) I: Overview". arXiv:1104.2932 [astro-ph.IM].
- ^ Hill, J. Colin; McDonough, Evan; Toomey, Michael W.; Alexander, Stephon (2020). "Early dark energy does not restore cosmological concordance". Physical Review D. 102 (4) 043507. arXiv:2003.07355. Bibcode:2020PhRvD.102d3507H. doi:10.1103/PhysRevD.102.043507.
- ^ Trotta, Roberto (2008). "Bayes in the sky: Bayesian inference and model selection in cosmology". Contemporary Physics. 49 (2): 71–104. arXiv:0803.4089. Bibcode:2008ConPh..49...71T. doi:10.1080/00107510802066753.
- ^ Dawid, A. P. and Mortera, J. (1996) "Coherent Analysis of Forensic Identification Evidence". Journal of the Royal Statistical Society, Series B, 58, 425–443.
- ^ Foreman, L. A.; Smith, A. F. M., and Evett, I. W. (1997). "Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)". Journal of the Royal Statistical Society, Series A, 160, 429–469.
- ^ Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons. Chichester. ISBN 978-0-471-96026-3.
- ^ Dawid, A. P. (2001) Bayes' Theorem and Weighing Evidence by Juries. Archived 2015-07-01 at the Wayback Machine
- ^ Gardner-Medwin, A. (2005) "What Probability Should the Jury Address?". Significance, 2 (1), March 2005.
- ^ "Bayes' Formula". pi.math.cornell.edu. Retrieved 2025-09-10.
- ^ a b c d e f g h Gunn, Lachlan J.; Chapeau-Blondeau, François; McDonnell, Mark D.; Davis, Bruce R.; Allison, Andrew; Abbott, Derek (March 2016). "Too good to be true: when overwhelming evidence fails to convince". Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences. 472 (2187) 20150748. arXiv:1601.00900. Bibcode:2016RSPSA.47250748G. doi:10.1098/rspa.2015.0748. PMC 4841483. PMID 27118917.
- ^ Miller, David (1994). Critical Rationalism. Chicago: Open Court. ISBN 978-0-8126-9197-9.
- ^ Howson & Urbach (2005), Jaynes (2003)
- ^ Cai, X.Q.; Wu, X.Y.; Zhou, X. (2009). "Stochastic scheduling subject to breakdown-repeat breakdowns with incomplete information". Operations Research. 57 (5): 1236–1249. doi:10.1287/opre.1080.0660.
- ^ Ogle, Kiona; Tucker, Colin; Cable, Jessica M. (2014-01-01). "Beyond simple linear mixing models: process-based isotope partitioning of ecological processes". Ecological Applications. 24 (1): 181–195. Bibcode:2014EcoAp..24..181O. doi:10.1890/1051-0761-24.1.181. ISSN 1939-5582. PMID 24640543.
- ^ Evaristo, Jaivime; McDonnell, Jeffrey J.; Scholl, Martha A.; Bruijnzeel, L. Adrian; Chun, Kwok P. (2016-01-01). "Insights into plant water uptake from xylem-water isotope measurements in two tropical catchments with contrasting moisture conditions". Hydrological Processes. 30 (18): 3210–3227. Bibcode:2016HyPr...30.3210E. doi:10.1002/hyp.10841. ISSN 1099-1085. S2CID 131588159.
- ^ Gupta, Ankur; Rawlings, James B. (April 2014). "Comparison of Parameter Estimation Methods in Stochastic Chemical Kinetic Models: Examples in Systems Biology". AIChE Journal. 60 (4): 1253–1268. Bibcode:2014AIChE..60.1253G. doi:10.1002/aic.14409. ISSN 0001-1541. PMC 4946376. PMID 27429455.
- ^ Schütz, N.; Holschneider, M. (2011). "Detection of trend changes in time series using Bayesian inference". Physical Review E. 84 (2) 021120. arXiv:1104.3448. Bibcode:2011PhRvE..84b1120S. doi:10.1103/PhysRevE.84.021120. PMID 21928962. S2CID 11460968.
- ^ Stigler, Stephen (1982). "Thomas Bayes's Bayesian Inference". Journal of the Royal Statistical Society. 145 (2): 250–58. doi:10.2307/2981538. JSTOR 2981538.
- ^ Stigler, Stephen M. (1986). "Chapter 3". The History of Statistics. Harvard University Press. ISBN 978-0-674-40340-6.
- ^ a b Fienberg, Stephen E. (2006). "When did Bayesian Inference Become 'Bayesian'?". Bayesian Analysis. 1 (1): 1–40 [p. 5]. doi:10.1214/06-ba101.
- ^ Bernardo, José-Miguel (2005). "Reference analysis". Handbook of statistics. Vol. 25. pp. 17–90.
- ^ Wolpert, R. L. (2004). "A Conversation with James O. Berger". Statistical Science. 19 (1): 205–218. CiteSeerX 10.1.1.71.6112. doi:10.1214/088342304000000053. MR 2082155. S2CID 120094454.
- ^ Bernardo, José M. (2006). "A Bayesian mathematical statistics primer" (PDF). Icots-7.
- ^ Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN 978-0-387-31073-2.
Sources
[edit]- Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). Parameter Estimation and Inverse Problems, Second Edition, Elsevier. ISBN 0123850487, ISBN 978-0123850485
- Bickel, Peter J. & Doksum, Kjell A. (2001). Mathematical Statistics, Volume 1: Basic and Selected Topics (Second (updated printing 2007) ed.). Pearson Prentice–Hall. ISBN 978-0-13-850363-5.
- Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis, Wiley, ISBN 0-471-57428-7
- Edwards, Ward (1968). "Conservatism in Human Information Processing". In Kleinmuntz, B. (ed.). Formal Representation of Human Judgment. Wiley.
- Edwards, Ward (1982). Daniel Kahneman; Paul Slovic; Amos Tversky (eds.). "Judgment under uncertainty: Heuristics and biases". Science. 185 (4157): 1124–1131. Bibcode:1974Sci...185.1124T. doi:10.1126/science.185.4157.1124. PMID 17835457. S2CID 143452957.
Chapter: Conservatism in Human Information Processing (excerpted)
- Jaynes E. T. (2003) Probability Theory: The Logic of Science, CUP. ISBN 978-0-521-59271-0 (Link to Fragmentary Edition of March 1996).
- Howson, C. & Urbach, P. (2005). Scientific Reasoning: the Bayesian Approach (3rd ed.). Open Court Publishing Company. ISBN 978-0-8126-9578-6.
- Phillips, L. D.; Edwards, Ward (October 2008). "Chapter 6: Conservatism in a Simple Probability Inference Task (Journal of Experimental Psychology (1966) 72: 346-354)". In Jie W. Weiss; David J. Weiss (eds.). A Science of Decision Making:The Legacy of Ward Edwards. Oxford University Press. p. 536. ISBN 978-0-19-532298-9.
Further reading
[edit]- For a full report on the history of Bayesian statistics and the debates with frequentists approaches, read Vallverdu, Jordi (2016). Bayesians Versus Frequentists A Philosophical Debate on Statistical Reasoning. New York: Springer. ISBN 978-3-662-48638-2.
- Clayton, Aubrey (August 2021). Bernoulli's Fallacy: Statistical Illogic and the Crisis of Modern Science. Columbia University Press. ISBN 978-0-231-55335-3.
Elementary
[edit]The following books are listed in ascending order of probabilistic sophistication:
- Stone, JV (2013), "Bayes' Rule: A Tutorial Introduction to Bayesian Analysis", Download first chapter here, Sebtel Press, England.
- Dennis V. Lindley (2013). Understanding Uncertainty, Revised Edition (2nd ed.). John Wiley. ISBN 978-1-118-65012-7.
- Colin Howson & Peter Urbach (2005). Scientific Reasoning: The Bayesian Approach (3rd ed.). Open Court Publishing Company. ISBN 978-0-8126-9578-6.
- Berry, Donald A. (1996). Statistics: A Bayesian Perspective. Duxbury. ISBN 978-0-534-23476-8.
- Morris H. DeGroot & Mark J. Schervish (2002). Probability and Statistics (third ed.). Addison-Wesley. ISBN 978-0-201-52488-8.
- Bolstad, William M. (2007) Introduction to Bayesian Statistics: Second Edition, John Wiley ISBN 0-471-27020-2
- Winkler, Robert L (2003). Introduction to Bayesian Inference and Decision (2nd ed.). Probabilistic. ISBN 978-0-9647938-4-2. Updated classic textbook. Bayesian theory clearly presented.
- Lee, Peter M. Bayesian Statistics: An Introduction. Fourth Edition (2012), John Wiley ISBN 978-1-1183-3257-3
- Carlin, Bradley P. & Louis, Thomas A. (2008). Bayesian Methods for Data Analysis, Third Edition. Boca Raton, FL: Chapman and Hall/CRC. ISBN 978-1-58488-697-6.
- Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC. ISBN 978-1-4398-4095-5.
Intermediate or advanced
[edit]- Berger, James O (1985). Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics (Second ed.). Springer-Verlag. Bibcode:1985sdtb.book.....B. ISBN 978-0-387-96098-2.
- Bernardo, José M.; Smith, Adrian F. M. (1994). Bayesian Theory. Wiley.
- DeGroot, Morris H., Optimal Statistical Decisions. Wiley Classics Library. 2004. (Originally published (1970) by McGraw-Hill.) ISBN 0-471-68029-X.
- Schervish, Mark J. (1995). Theory of statistics. Springer-Verlag. ISBN 978-0-387-94546-0.
- Jaynes, E. T. (1998). Probability Theory: The Logic of Science.
- O'Hagan, A. and Forster, J. (2003). Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, New York. ISBN 0-340-52922-9.
- Robert, Christian P (2007). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (paperback ed.). Springer. ISBN 978-0-387-71598-8.
- Pearl, Judea. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann.
- Pierre Bessière et al. (2013). "Bayesian Programming". CRC Press. ISBN 9781439880326
- Francisco J. Samaniego (2010). "A Comparison of the Bayesian and Frequentist Approaches to Estimation". Springer. New York, ISBN 978-1-4419-5940-9
External links
[edit]- "Bayesian approach to statistical problems", Encyclopedia of Mathematics, EMS Press, 2001 [1994]
- Bayesian Statistics from Scholarpedia.
- Introduction to Bayesian probability from Queen Mary University of London
- Mathematical Notes on Bayesian Statistics and Markov Chain Monte Carlo
- Bayesian reading list Archived 2011-06-25 at the Wayback Machine, categorized and annotated by Tom Griffiths
- A. Hajek and S. Hartmann: Bayesian Epistemology, in: J. Dancy et al. (eds.), A Companion to Epistemology. Oxford: Blackwell 2010, 93–106.
- S. Hartmann and J. Sprenger: Bayesian Epistemology, in: S. Bernecker and D. Pritchard (eds.), Routledge Companion to Epistemology. London: Routledge 2010, 609–620.
- Stanford Encyclopedia of Philosophy: "Inductive Logic"
- Bayesian Confirmation Theory (PDF)
- What is Bayesian Learning?
- Data, Uncertainty and Inference — Informal introduction with many examples, ebook (PDF) freely available at causaScientia
Bayesian inference
View on GrokipediaFundamentals
Bayes' Theorem
Bayes' theorem is a fundamental result in probability theory that describes how to update the probability of a hypothesis based on new evidence. It is derived from the basic definition of conditional probability. The conditional probability of event given event (with ) is defined as the ratio of the joint probability to the marginal probability : Similarly, the reverse conditional probability is assuming . Equating the two expressions for the joint probability yields , and solving for gives This is Bayes' theorem, where in the denominator is the marginal probability of , often computed as over a partition of events .[4] In terms of inference, Bayes' theorem formalizes the process of updating the probability of a hypothesis in light of evidence , yielding the posterior probability as proportional to the product of the prior probability and the likelihood , normalized by the total probability of the evidence . This framework enables the revision of initial beliefs about causes or states based on observed effects or data.[5] A useful verbal interpretation of the theorem uses odds ratios. The posterior odds in favor of hypothesis over alternative given evidence are the prior odds multiplied by the likelihood ratio , which quantifies how much more (or less) likely the evidence is under than under . If the likelihood ratio exceeds 1, the evidence strengthens support for ; if below 1, it weakens it.[6] The theorem is named after Thomas Bayes (c. 1701–1761), an English mathematician and Presbyterian minister, who formulated it in an essay likely written in the late 1740s but published posthumously in 1763 as "An Essay Towards Solving a Problem in the Doctrine of Chances" in the Philosophical Transactions of the Royal Society, edited by his colleague Richard Price. Independently, the French mathematician Pierre-Simon Laplace rediscovered the result around 1774 and developed its applications in inverse probability, with his 1812 treatise giving it wider prominence before Bayes's name was retroactively attached by R. A. Fisher in 1950.[7]Prior, Likelihood, and Posterior
In Bayesian inference, the prior distribution encodes the initial beliefs or knowledge about the unknown parameters θ before any data are observed. It is a probability distribution assigned to the parameter space, which can incorporate expert opinion, historical data, or theoretical considerations. Subjective priors reflect the personal degrees of belief of the analyst, as emphasized in the subjectivist interpretation of probability, where probabilities are coherent previsions that avoid Dutch books. Objective priors, on the other hand, aim to be minimally informative and free from subjective input, such as uniform priors over a bounded parameter space or the Jeffreys prior, which is derived from the Fisher information matrix to ensure invariance under reparameterization. The likelihood function quantifies the probability of observing the data y given a specific value of the parameters θ, denoted as . It arises from the probabilistic model of the data-generating process and is typically specified based on the assumed sampling distribution, such as a normal or binomial likelihood depending on the nature of the data. Unlike in frequentist statistics, where the likelihood is used to estimate point values of θ, in Bayesian inference it serves to update the prior by weighting parameter values according to how well they explain the observed data. The posterior distribution represents the updated beliefs about the parameters after incorporating the data, given by Bayes' theorem as . This proportionality holds because the full expression includes a normalizing constant, the marginal likelihood , which integrates over all possible parameter values to ensure the posterior is a valid probability distribution. The marginal likelihood, also known as the evidence or model probability, plays a crucial role in comparing different models, as it measures the overall predictive adequacy of the model without conditioning on specific parameters.Updating Beliefs
In Bayesian inference, the process of updating beliefs begins with a prior distribution that encodes an agent's initial state of knowledge or subjective beliefs about an uncertain parameter or hypothesis. As new evidence in the form of observed data arrives, this prior is systematically revised to produce a posterior distribution that integrates the information from the data, weighted by its likelihood under different possible values of the parameter. This dynamic revision reflects a coherent approach to learning, where beliefs evolve rationally in response to empirical evidence, allowing for the quantification and propagation of uncertainty throughout the inference process.[8] The mathematical basis for this updating is Bayes' theorem, which formalizes the combination of prior beliefs and data evidence into updated posteriors. An insightful reformulation expresses the process in terms of odds ratios: the posterior odds in favor of one hypothesis over another equal the prior odds multiplied by the Bayes factor, a quantity that captures solely the evidential impact of the data by comparing the likelihoods under the competing hypotheses. This odds-based view, pioneered by Harold Jeffreys, separates the roles of initial beliefs and data-driven evidence, facilitating the assessment of how strongly observations support or refute particular models.[9] While Bayesian updating relies on probabilistic priors and likelihoods, alternative frameworks offer contrasting approaches to belief revision. Logical probability methods, as developed by Rudolf Carnap, derive degrees of confirmation from the structural similarities between evidence and hypotheses using purely logical principles, eschewing subjective priors in favor of objective inductive rules. In a different vein, the Dempster-Shafer theory extends beyond additive probabilities by employing belief functions that distribute mass over subsets of hypotheses, enabling the representation of both uncertainty and ignorance without committing to precise point probabilities; this allows for more flexible combination of evidence sources compared to strict Bayesian conditioning. These alternatives highlight limitations in Bayesian methods, such as sensitivity to prior specification, but often sacrifice the full coherence and normalization properties of probability.[10] A fundamental heuristic for effective Bayesian updating is Cromwell's rule, which cautions against assigning prior probabilities of exactly zero to logically possible events or one to logically impossible ones, as such extremes can immunize beliefs against contradictory evidence—for example, a zero prior ensures the posterior remains zero irrespective of data strength. Articulated by Dennis Lindley and inspired by Oliver Cromwell's plea to "think it possible you may be mistaken," this rule promotes priors that remain responsive to information, fostering robust inference even under incomplete initial knowledge.[11]Bayesian Updating
Single Observation
In Bayesian inference, updating the belief about a parameter upon observing a single data point follows directly from Bayes' theorem, yielding the posterior distribution , where denotes the prior distribution and the likelihood function. The symbol indicates proportionality, as the posterior is the unnormalized product of the likelihood and prior; to obtain the proper probability distribution, it must be scaled by the marginal likelihood (or evidence) for continuous , ensuring the posterior integrates to 1.[12] This framework is particularly straightforward when represents discrete hypotheses that are mutually exclusive and exhaustive, such as a finite set . In this case, the posterior probability for each hypothesis is , where the denominator serves as the normalizing constant, explicitly computable as the sum of the joint probabilities over all hypotheses.[13] For simple cases with few hypotheses, such as binary outcomes (e.g., two competing explanations), this normalization is direct: if the prior odds are and the likelihood ratio is , the posterior odds become their product, with the marginal following as .[14] To illustrate, consider updating the prior probability of rain tomorrow (0.1) based on a single weather reading, such as a cloudy morning, where the likelihood of clouds given rain is 0.8 and the marginal probability of clouds is 0.4; the posterior probability of rain then shifts upward to 0.2 to reflect this evidence, computed via the discrete formula above.[15] Such single-observation updates form the foundation for incorporating additional data through repeated application of Bayes' theorem.Multiple Observations
In Bayesian inference, the framework for incorporating multiple observations extends the single-observation case by combining evidence from several data points to update the prior distribution on the parameter . For independent and identically distributed (i.i.d.) observations , the posterior distribution is given by where the likelihood term factors into a product due to the i.i.d. assumption, reflecting how each observation contributes multiplicatively to the evidence for .[16] This formulation scales the single-observation update, where the posterior is proportional to the prior times one likelihood, to a batch of data, enabling efficient incorporation of accumulated evidence.[17] The i.i.d. assumption—that the observations are independent conditional on —simplifies the joint likelihood to the product form, making analytical or computational inference tractable in many models, such as those from the exponential family.[16] This conditional independence is a modeling choice, often justified by the data-generating process, but it can be relaxed when observations exhibit dependence; in such cases, the full joint likelihood is used instead of the product, which may require specifying covariance structures or hierarchical models to capture correlations.[17] For example, in time-series data, autoregressive components can model temporal dependence while still applying Bayes' theorem to the joint distribution.[16] The marginal likelihood for the multiple observations, which normalizes the posterior, is under the i.i.d. assumption, representing the predictive probability of the data averaged over the prior.[16] This integral, also known as the evidence, plays a key role in model selection via Bayes factors but can be challenging to compute exactly, often approximated using simulation methods like Markov chain Monte Carlo.[17] When accumulating data from multiple sources or repeated experiments, the batch posterior formula allows direct computation using the full product of likelihoods and the initial prior, avoiding the need to iteratively re-derive intermediate posteriors for subsets of the data.[16] This approach is particularly advantageous in large datasets, where the evidence from all observations is combined proportionally without stepwise adjustments, preserving the coherence of belief updating while scaling to practical applications in fields like epidemiology or machine learning.[17]Sequential Updating
Sequential updating in Bayesian inference involves iteratively refining the posterior distribution as new observations arrive over time, enabling a dynamic incorporation of evidence. The core mechanism is the recursive application of Bayes' theorem, where the posterior at time , , is proportional to the likelihood of the new observation given the parameter , multiplied by the posterior from the previous step . Formally, assuming the observations are conditionally independent given . This form treats the previous posterior as the prior for the current update, allowing beliefs to evolve incrementally without recomputing from the initial prior each time.[16] For independent and identically distributed observations, this sequential process yields the same result as a single batch update using all data at once.[18] The advantages of this recursive approach are pronounced in online learning environments, where data streams continuously and computational efficiency is paramount, as it avoids the need to store or reprocess the entire dataset. It supports real-time decision-making by providing updated inferences after each new datum, which is essential for adaptive algorithms that respond to evolving information. Additionally, sequential updating is well-suited to dynamic models, where parameters or states change over time, facilitating the tracking of temporal variations through successive refinements of the probability distribution. These benefits have been demonstrated in large-scale data applications, such as cognitive modeling with high-velocity datasets, where incremental updates preserve inferential accuracy while managing resource constraints.[19] A conceptual example arises in time series filtering, where sequential updating estimates latent states underlying observed data, such as inferring a system's hidden trajectory from noisy sequential measurements. At each time step, the current posterior—representing beliefs about the state—serves as the prior, which is then updated with the new observation's likelihood to produce a sharper estimate, progressively reducing uncertainty as more evidence accumulates. This process mirrors belief revision in sequential data contexts, emphasizing how each update builds on prior knowledge to form a coherent evolving picture.[20] Despite these strengths, sequential updating presents challenges, particularly in eliciting an appropriate initial prior for long sequences of observations. The choice of starting prior can influence early updates disproportionately if data is sparse initially, and even as subsequent data dominates, misspecification may introduce subtle biases that propagate through the chain. Careful expert elicitation is thus crucial to ensure the prior reflects genuine uncertainty without unduly skewing long-term posteriors, a process that requires structured methods to aggregate domain knowledge reliably.[21]Formal Framework
Definitions and Notation
In the Bayesian framework for parametric statistical models, the unknown parameters are elements θ of a parameter space Θ, typically a subset of ℝᵖ for some dimension p, while the observed data consist of realizations x from an observable space X, which may be discrete, continuous, or mixed.[22] The prior distribution encodes initial uncertainty about θ via a probability measure π on Θ, which in the continuous case is specified by a density π(θ) with respect to a dominating measure (such as Lebesgue measure), and in the discrete case by a probability mass function.[22] The likelihood function is the conditional probability measure of x given θ, denoted f(x|θ), which serves as the density or mass function of the sampling distribution x ~ f(·|θ).[22] Distinctions between densities and probabilities arise depending on the nature of the spaces: for continuous X and Θ, π(θ) and f(x|θ) are probability density functions, integrating to 1 over their respective spaces, whereas for discrete cases they are probability mass functions summing to 1.[22] In scenarios involving point masses, such as degenerate priors or discrete components in mixed distributions, the Dirac delta function δ_τ(θ) represents a unit point mass at a specific value τ ∈ Θ, defined such that for any continuous function g at τ, ∫ g(θ) δ_τ(θ) dθ = g(τ).[23] The posterior distribution π(θ|x) then combines the prior and likelihood to reflect updated beliefs about θ after observing x, with Bayes' theorem providing the linkage in the form π(θ|x) ∝ f(x|θ) π(θ).[22] This general setup underpins Bayesian inference in parametric models, where Θ parameterizes the family of distributions {f(·|θ) : θ ∈ Θ}.[22]Posterior Distribution
In Bayesian inference, the posterior distribution represents the updated state of knowledge about the unknown parameters after observing the data , synthesizing prior beliefs with the evidence provided by the likelihood. This distribution, denoted , quantifies the relative plausibility of different values of conditional on , serving as the foundation for all parameter-focused inferences such as estimating or assessing its uncertainty.[16] The posterior is formally derived from Bayes' theorem, which states that the joint density of and factors as , where is the likelihood function and is the prior distribution. The posterior then follows as the conditional density: with the marginal likelihood acting as the normalizing constant to ensure integrates to 1 over . This update rule, originally proposed by Thomas Bayes, proportionally weights the prior by the likelihood and normalizes to produce a proper probability distribution.[24][16] Bayesian posteriors can be parametric or non-parametric, differing in the dimensionality and flexibility of the parameter space. Parametric posteriors assume lies in a finite-dimensional space, constraining the form of the distribution (e.g., a normal likelihood with unknown mean yielding a normal posterior under a normal prior), which facilitates computation but may impose overly rigid assumptions on the data-generating process. In contrast, non-parametric posteriors operate over infinite-dimensional spaces, such as distributions indexed by functions or measures (e.g., via Dirichlet process priors), enabling adaptive modeling of complex, unspecified structures while maintaining coherent uncertainty quantification.[25] The posterior's role in inference centers on its use to draw conclusions about given , such as computing expectations for point summaries or integrating over it for decision-making under uncertainty, thereby providing a complete probabilistic framework for parameter estimation and hypothesis evaluation.[16]Predictive Distribution
In Bayesian inference, the predictive distribution for new, unobserved data given observed data is obtained by integrating the likelihood of the new data over the posterior distribution of the parameters . This is known as the posterior predictive distribution, formally expressed as where is the sampling distribution (likelihood) for the new data and is the posterior density of the parameters.[16] This formulation marginalizes over the uncertainty in , providing a full probabilistic description of future observations that accounts for both data variability and parameter estimation error. The computation of the posterior predictive distribution involves marginalization, which integrates out the parameters from the joint posterior predictive density . In practice, this integral is rarely tractable analytically and is typically approximated using simulation methods, such as drawing samples from the posterior and then generating replicated data for , yielding an empirical approximation to the distribution.[16] These simulations enable the estimation of predictive quantities like means, variances, or quantiles directly from the sample of . Unlike frequentist plug-in predictions, which substitute a point estimate (e.g., the maximum likelihood estimate) for into the likelihood to obtain a predictive distribution , the Bayesian posterior predictive averages over the entire posterior, incorporating parameter uncertainty and potentially prior information. This leads to wider predictive intervals in small samples and better calibration for forecasting, as the plug-in approach underestimates variability by treating as fixed.[16] The posterior predictive distribution is central to forecasting new data in applications such as election outcomes or environmental modeling, where it generates probabilistic predictions by propagating posterior uncertainty forward.[16] It also facilitates model checking through posterior predictive checks, which compare observed data to simulated replicates from the posterior predictive to assess fit, such as by evaluating discrepancies via test statistics like means or extremes.Mathematical Properties
Marginalization and Conditioning
In Bayesian inference, marginalization is the process of obtaining the probability distribution of a subset of variables by integrating out the others from their joint distribution, effectively accounting for uncertainty in those excluded variables. This operation is essential for focusing on quantities of interest while treating others as nuisance parameters. For instance, the marginal likelihood, also known as the evidence, for observed data under a model parameterized by is given by where is the sampling distribution or likelihood of the data given the parameters, and is the prior distribution on . This integral represents the predictive probability of the data under the prior model and serves as a normalizing constant in Bayes' theorem. The law of total probability provides the foundational justification for marginalization in the Bayesian context, stating that the unconditional density of a variable is the expected value of its conditional density with respect to the marginal density of the conditioning variables. In continuous form, this is which directly corresponds to the evidence computation and extends naturally to discrete cases via summation.[26] By performing marginalization, Bayesian analyses can reduce the dimensionality of high-dimensional parameter spaces, making inference more tractable and interpretable without losing the uncertainty encoded in the integrated variables. Conditioning complements marginalization by restricting probabilities to scenarios consistent with observed evidence or specified conditions, thereby updating beliefs about remaining uncertainties. In Bayesian inference, conditioning on data transforms the prior into the posterior via where the denominator is the marginalized evidence. This operation can also apply to subsets of data or auxiliary parameters, allowing for targeted updates that incorporate partial information. Together, marginalization and conditioning enable the decomposition of complex joint distributions into manageable components, facilitating dimensionality reduction and precise probabilistic reasoning in Bayesian models.Conjugate Priors
In Bayesian inference, a conjugate prior is defined as a family of prior probability distributions for which the posterior distribution belongs to the same family after updating with data from a specified likelihood function. This property ensures that the posterior can be obtained by simply updating the parameters of the prior, without requiring changes in the distributional form. The concept is particularly useful for distributions in the exponential family, where conjugate priors can be constructed to match the sufficient statistics of the likelihood.[27] A classic example is the Beta-Binomial model, where the parameter of a Binomial likelihood represents the success probability. The prior is taken as , with density proportional to . For independent observations yielding successes, the posterior is . This update interprets and as pseudocounts of prior successes and failures, respectively.[28] Another prominent case is the Normal-Normal conjugate pair, applicable when estimating the mean of a Normal distribution with known variance. The prior is . Given i.i.d. observations with sample mean , the posterior is: The posterior mean is a precision-weighted average of the prior mean and sample mean, while the posterior variance is reduced relative to both.[29] For count data, the Gamma-Poisson model provides conjugacy, with the Poisson rate having prior , density proportional to . For i.i.d. Poisson observations summing to , the posterior is . Here, and act as prior shape and rate parameters updated by the total counts and exposure time. The primary advantage of conjugate priors lies in their analytical tractability: posteriors, marginal likelihoods, and predictive distributions can often be derived in closed form, avoiding numerical integration and enabling efficient sequential updating in dynamic models. This is especially beneficial for evidence calculation via marginalization, where the normalizing constant is straightforward to compute. However, conjugate families impose restrictions on the form of prior beliefs, potentially limiting flexibility in capturing complex or data-driven uncertainties, which may require sensitivity analyses to assess robustness.[31]Asymptotic Behavior
As the sample size increases, the Bayesian posterior distribution exhibits desirable asymptotic properties under suitable regularity conditions, ensuring that inference becomes increasingly reliable. A fundamental result is the consistency of the posterior, which states that the posterior probability concentrates on the true parameter value almost surely with respect to the data-generating measure, provided the model is well-specified and the prior assigns positive mass to neighborhoods of . This property, first established by Doob, implies that the posterior mean and other summaries converge to , justifying the use of Bayesian methods for large datasets. Under additional smoothness and identifiability assumptions, the Bernstein-von Mises theorem provides a more precise characterization: the posterior distribution asymptotically approximates a normal distribution centered at the maximum likelihood estimator , with covariance matrix given by the inverse observed Fisher information , scaled by . Specifically, for , in total variation distance, almost surely. This approximation holds for i.i.d. data from a correctly specified parametric model and priors that are sufficiently smooth and non-degenerate near , as detailed in standard treatments of asymptotic statistics. The rate of convergence in the Bernstein-von Mises theorem is typically , reflecting the parametric efficiency of the posterior, which matches the frequentist central limit theorem for the MLE. Asymptotically, the influence of the prior diminishes, with the posterior becoming increasingly dominated by the likelihood; the prior's effect is of higher order, , ensuring that posterior credible intervals align closely with confidence intervals based on the observed information. This vanishing prior influence underscores the robustness of Bayesian inference to prior choice in large samples. In cases of model misspecification, where the true data-generating distribution lies outside the assumed model, these asymptotic behaviors adapt accordingly. The posterior remains consistent but concentrates on a pseudo-true parameter that minimizes the Kullback-Leibler divergence from the true distribution to the model, rather than the true . The Bernstein-von Mises approximation still holds, now centered at the MLE converging to , with the asymptotic normality preserved under local asymptotic normality conditions on the misspecified likelihood. However, the rate may degrade in severely misspecified scenarios, and prior influence can persist if the prior favors regions away from .[32]Estimation and Inference
Point Estimates
In Bayesian inference, point estimates provide a single summary value for the parameter of interest, derived from the posterior distribution , where is the parameter and represents the observed data. These estimates balance prior beliefs with the likelihood of the data, offering a way to condense the full posterior into a practical representative value. The choice of point estimate depends on the decision-theoretic framework, particularly the loss function that quantifies the cost of estimation error.[33] The posterior mean, also known as the Bayes estimator under squared error loss, is given by This estimate minimizes the expected posterior loss , making it suitable when errors are symmetrically penalized proportional to their squared magnitude. For instance, in estimating a normal mean with a normal prior, the posterior mean is a weighted average of the prior mean and the sample mean, reflecting the precision of each. The posterior mean is often preferred in applications requiring unbiased summaries under quadratic penalties, as it coincides with the minimum mean squared error estimator in the posterior sense.[33][34] The posterior median minimizes the expected absolute error loss and serves as a robust point estimate, particularly when the posterior is skewed or outliers are a concern. It is defined as the value such that . This property makes the median less sensitive to extreme posterior tails compared to the mean. In contrast, the maximum a posteriori (MAP) estimate, which is the posterior mode , minimizes the 0-1 loss function , where is the indicator function; it is ideal for scenarios penalizing any deviation equally, regardless of size, and often aligns with maximizing the posterior density, equivalent to penalized maximum likelihood. The MAP can be computed via optimization techniques and is computationally convenient when the posterior is unimodal.[33][34] The selection among these estimates hinges on the assumed loss function: squared loss favors the mean for its emphasis on large errors, absolute loss suits the median for robustness, and 0-1 loss highlights the mode for peak posterior probability. Unlike frequentist point estimates, such as the maximum likelihood estimator, which rely solely on the data and exhibit properties like consistency in large samples without priors, Bayesian point estimates incorporate prior information, potentially improving accuracy in small-sample or informative-prior settings but introducing dependence on prior choice.[33]Credible Intervals
In Bayesian inference, a credible interval provides a range for an unknown parameter such that the posterior probability that lies within the interval, given the observed data , equals . Formally, a credible interval satisfies where the probability is computed with respect to the posterior distribution . This direct probabilistic statement contrasts with frequentist confidence intervals, which quantify the long-run frequency with which a procedure produces intervals containing the fixed true parameter, without assigning probability to itself given the data.[35] Two primary types of credible intervals are the equal-tail interval and the highest posterior density (HPD) interval. The equal-tail interval is defined by the central portion of the posterior, specifically the interval between the and quantiles of ; it is symmetric in probability mass but may not be the shortest possible interval. In contrast, the HPD interval is the shortest interval achieving the coverage , consisting of the set where is chosen such that the integral over this set equals ; this makes it particularly suitable for skewed posteriors, as it prioritizes regions of highest density. The equal-tail approach performs well for symmetric unimodal posteriors, where the two types coincide, but the HPD generally offers better efficiency for asymmetric cases.[36] Computation of credible intervals depends on the posterior form. For models with conjugate priors, where the posterior belongs to a known parametric family (e.g., beta or normal), credible intervals can be obtained analytically using the cumulative distribution function or quantile functions of that family. In non-conjugate or complex cases, numerical methods are required, such as Markov chain Monte Carlo (MCMC) sampling to approximate , followed by quantile estimation for equal-tail intervals or optimization algorithms to find the HPD region. These numerical approaches ensure reliable interval construction even for high-dimensional parameters.Hypothesis Testing
In Bayesian hypothesis testing, hypotheses are evaluated through the comparison of posterior probabilities derived from Bayes' theorem, providing a direct measure of relative evidence in favor of competing models or hypotheses. Unlike approaches that rely on long-run frequencies, this framework incorporates prior beliefs and updates them with observed data to assess the plausibility of each hypothesis. A central tool for this purpose is the Bayes factor, which quantifies the relative support for one hypothesis over another based on the data alone.[37] The Bayes factor (BF) is defined as the ratio of the marginal likelihoods under two competing hypotheses, , where is the marginal probability of the data under hypothesis , obtained by integrating the likelihood over the prior distribution for the parameters under that hypothesis. This ratio arises from the work of Harold Jeffreys, who developed it as a method for objective model comparison in scientific inference.[38] Values of BF greater than 1 indicate evidence in favor of , while values less than 1 support ; for instance, BF values between 3 and 10 are often interpreted as substantial evidence according to Jeffreys' scale.[37] The marginal likelihoods can be challenging to compute analytically, particularly for complex models, but approximations such as Laplace's method or numerical integration are commonly employed.[37] Posterior odds for the hypotheses are then obtained by multiplying the Bayes factor by the prior odds: . This relationship, a direct consequence of Bayes' theorem, allows the incorporation of subjective or objective prior probabilities on the hypotheses themselves, yielding posterior probabilities that can guide decisions.[37] For point null hypotheses, such as , the posterior odds can be linked to credible intervals by examining the posterior density at the null value, though this is typically a secondary consideration to the Bayes factor approach.[9] For testing equivalence or practical null hypotheses, where the goal is to determine if a parameter lies within a predefined interval of negligible effect (e.g., no meaningful difference), the region of practical equivalence (ROPE) provides a complementary Bayesian procedure. The ROPE is specified as an interval around the null value, such as , reflecting domain-specific notions of practical insignificance. Evidence for the null is declared if a high-density interval (e.g., 95% highest density interval) of the posterior falls entirely within the ROPE, while evidence against equivalence occurs if the interval lies outside. This method, advocated by John Kruschke, addresses limitations in traditional testing by explicitly quantifying decisions about parameter values rather than point estimates.[39] Despite these advantages, Bayesian hypothesis testing via Bayes factors and related tools exhibits sensitivity to the choice of priors on model parameters and hypotheses, which can substantially alter the marginal likelihoods and thus the evidential conclusions. This dependence underscores the need for robustness checks, such as varying the priors and reporting the range of resulting Bayes factors, to ensure inferences are not overly influenced by prior specifications.[40]Examples
Coin Toss Problem
The coin toss problem exemplifies Bayesian inference in a simple discrete setting, where the goal is to estimate the unknown probability of the coin landing heads, assuming independent tosses. This scenario models situations like estimating success probabilities in binary trials, such as defect rates or election outcomes. Observations of heads and tails update an initial belief (prior) about to form a posterior distribution that quantifies updated uncertainty.[16] The setup begins with a binomial likelihood for the data: given tosses, the number of heads follows , with probability mass function . The prior distribution for is chosen as the beta distribution, , with density , where and act as prior counts of heads and tails, respectively. This choice is convenient because the beta family is conjugate to the binomial, ensuring the posterior remains beta-distributed: . The conjugacy of the beta-binomial pair facilitates analytical updates and was systematically developed in early Bayesian decision theory frameworks.[16][41] The posterior mean provides a point estimate for : , which blends the prior mean and the maximum likelihood estimate , with weights proportional to their effective sample sizes and . For uncertainty quantification after tosses, a 95% credible interval is the 0.025 and 0.975 quantiles of the posterior beta distribution, which can be obtained via the beta quantile function . As an illustration, consider a uniform prior () and data of 437 heads in 980 tosses; the posterior has mean approximately 0.446 and 95% credible interval [0.415, 0.477], showing contraction around the data while influenced by the prior.[16] Visualization of the prior and posterior densities reveals the updating process: the prior beta density starts as a broad curve (e.g., uniform for ), and successive data incorporation shifts the mode toward while reducing variance, as seen in overlaid density plots. For small , the posterior retains substantial prior shape; with large , it approximates a normal density centered at the sample proportion. These plots, often generated using statistical software, underscore the gradual dominance of data over prior beliefs.[16]Medical Diagnosis
In medical diagnosis, Bayesian inference enables the calculation of the probability that a patient has a disease after receiving a test result, by combining the disease's prior probability (typically its prevalence in the population) with the test's likelihood properties. Sensitivity, defined as the probability of a positive test given the presence of the disease, and specificity, the probability of a negative test given the absence of the disease, serve as the key likelihood ratios in this updating process. These parameters allow clinicians to compute the posterior probability using Bayes' theorem, which formally is where denotes the presence of the disease, a positive test result, and specificity; an analogous formula applies for a negative test result.[42] A classic illustrative example involves a rare disease with a 1% prevalence () and a diagnostic test exhibiting 99% sensitivity () and 99% specificity (). To compute the posteriors, consider a hypothetical cohort of 10,000 individuals screened for the disease. The resulting contingency table breaks down the outcomes as follows:| Disease Present () | Disease Absent () | Total | |
|---|---|---|---|
| Positive Test () | 99 (true positives) | 99 (false positives) | 198 |
| Negative Test () | 1 (false negative) | 9,801 (true negatives) | 9,802 |
| Total | 100 | 9,900 | 10,000 |
Linear Regression
Bayesian linear regression applies Bayesian inference to model the conditional expectation of a response variable given predictors , assuming a linear relationship with additive Gaussian noise. The model is specified as , where and is known. This setup allows for exact posterior inference when a conjugate normal prior is used on the regression coefficients .[45] A natural conjugate prior for is the multivariate normal distribution, , where is the prior precision matrix. The likelihood is . The resulting posterior distribution is also multivariate normal, with updated precision and mean . This conjugate update combines the prior information with the data evidence in a closed form, enabling straightforward computation of posterior summaries such as the mean and credible intervals for .[45] The predictive distribution for a new response at covariate values follows from integrating over the posterior, This distribution quantifies uncertainty in predictions, incorporating both the residual variance and the posterior variability in , which widens for extrapolations where lies far from the training data support.[45] In comparison to ordinary least squares (OLS), which yields the point estimate , the Bayesian posterior mean acts as a shrinkage estimator that pulls estimates toward the prior mean . The strength of shrinkage depends on the prior precision relative to the data precision ; a weakly informative prior (large ) results in , while stronger priors regularize against overfitting, particularly in low-data regimes. Unlike OLS, which lacks inherent uncertainty quantification for coefficients, the full posterior provides a distribution over possible values.Comparisons with Frequentist Methods
Key Differences
Bayesian inference fundamentally differs from frequentist statistics in its philosophical foundations and methodological approaches, particularly in how uncertainty is quantified and incorporated into statistical reasoning. At the core of this distinction lies the interpretation of probability: in the Bayesian paradigm, probability represents a degree of belief about unknown parameters, treating them as random variables with distributions that reflect subjective or objective uncertainty.[46] In contrast, the frequentist view regards parameters as fixed but unknown constants, with probability defined as the long-run frequency of events in repeated sampling, applying only to random data generation processes.[16] This epistemological divide shapes all subsequent aspects of inference, emphasizing belief updating in Bayesian methods versus objective sampling properties in frequentist ones.[1] A primary methodological contrast arises in the process of inference. Bayesian inference derives the posterior distribution of parameters by combining the likelihood of the observed data with a prior distribution, yielding direct probabilistic statements about parameter values, such as the probability that a parameter lies within a certain interval.[46] Frequentist inference, however, relies on the sampling distribution of statistics under fixed parameters, producing measures like p-values or confidence intervals that describe the behavior of estimators over hypothetical repeated samples rather than probabilities for the parameters themselves.[16] For instance, a Bayesian credible interval quantifies the plausible range for a parameter given the data and prior, while a frequentist confidence interval indicates the method's long-run coverage reliability, not a direct probability statement.[1] The role of prior information further delineates these paradigms. Bayesian methods explicitly incorporate prior distributions to represent pre-existing knowledge or assumptions about parameters, allowing for the subjective integration of expert opinion or historical data into the analysis, which can be updated sequentially as new evidence emerges.[46] Frequentist approaches eschew priors entirely, aiming for objectivity by basing inferences solely on the observed data and likelihood, without accommodating prior beliefs, which proponents argue avoids bias but limits flexibility in incorporating domain knowledge.[16] This inclusion of priors in Bayesian inference is often contentious, as it introduces elements of subjectivity, yet it enables more nuanced modeling in complex scenarios where data alone may be insufficient.[47] Regarding repeatability and the nature of statistical conclusions, frequentist statistics emphasizes long-run frequency properties, such as the coverage probability of confidence intervals approaching the nominal level over infinite repetitions of the experiment under the true parameter.[46] Bayesian inference, by contrast, focuses on updating beliefs through the posterior, providing a coherent framework for sequential learning where conclusions evolve with accumulating data, without reliance on hypothetical repetitions.[16] This belief-updating mechanism allows Bayesian methods to offer interpretable probabilities for hypotheses directly, fostering a dynamic approach to uncertainty that aligns with inductive reasoning in scientific inquiry.[1]Model Selection
In Bayesian inference, model selection involves comparing multiple competing models to determine which best explains the observed data, accounting for both fit and complexity. The posterior probability of a model given data is given by , where is the prior probability of the model and is the marginal likelihood, also known as the evidence or predictive density of the data under the model.[48] This formulation naturally incorporates prior beliefs about model plausibility and favors models that balance goodness-of-fit with parsimony.[48] The marginal likelihood is computed as the integral , integrating out the model parameters with respect to their prior distribution.[48] This integral quantifies the average predictive performance of the model across its parameter space, penalizing overly complex models whose prior probability mass is dispersed over a larger volume, thus making it less likely to concentrate on the data under a point null or simple alternative.[48] For comparing two models and , the Bayes factor serves as the ratio of their marginal likelihoods, providing a measure of relative evidence; values greater than 1 indicate support for .[48] This approach embodies Occam's razor through the inherent complexity penalty in the marginal likelihood: simpler models assign higher prior density to parameter regions compatible with the data, yielding higher evidence, while complex models dilute this density across implausible regions, reducing their posterior odds unless the data strongly favors the added flexibility.[48] Posterior model probabilities can then be obtained by normalizing over all models, enabling probabilistic statements about model uncertainty, such as the probability that the true model is among a subset.[48] Computing the marginal likelihood exactly is often intractable for high-dimensional models, leading to approximations like the Bayesian Information Criterion (BIC), which provides an asymptotic estimate: , where is the maximized likelihood, is the number of parameters in , and is the sample size; lower BIC values approximate higher log marginal likelihoods.[49][48] Similarly, the Akaike Information Criterion (AIC), , can be interpreted in a Bayesian context as a rough approximation to the relative expected Kullback-Leibler divergence, though it applies a milder penalty and is less consistent for model selection in large samples compared to BIC.[50][48] These criteria facilitate practical model comparison by approximating the Bayesian evidence without full integration.[48]Decision Theory Integration
Bayesian decision theory integrates the principles of Bayesian inference with decision-making under uncertainty, providing a framework for selecting actions that minimize expected losses based on posterior beliefs. In this approach, a loss function quantifies the penalty for taking action when the true parameter is the case, allowing decisions to be evaluated relative to probabilistic assessments of uncertainty.[33] The posterior expected loss, or posterior risk, for an action given data is then computed as , where is the posterior distribution; the optimal Bayes action minimizes this quantity for each observed .[33] This setup ensures that decisions are coherent with the subjective or objective probabilities encoded in the prior and updated via Bayes' theorem. The overall performance of a decision rule is assessed through the Bayes risk, which averages the risk function over the prior distribution: . A Bayes rule , which minimizes the posterior risk for prior , in turn minimizes the Bayes risk among all decision rules, establishing it as the optimal procedure under the chosen prior.[33] For specific loss functions, such as squared error loss , the Bayes rule corresponds to the posterior mean as a point estimate, linking decision theory directly to common Bayesian summaries.[33] Within the Bayesian framework, admissibility requires that no other decision rule has a risk function that is smaller or equal everywhere and strictly smaller for some ; Bayes rules are generally admissible, particularly under conditions like bounded loss functions or compact parameter spaces, as they achieve the minimal possible risk in a neighborhood of the prior.[33] The minimax criterion, which seeks to minimize the maximum risk , can be attained by Bayes rules when the risk is constant over , providing a robust alternative when priors are uncertain. This Bayesian minimax approach contrasts with non-Bayesian versions by incorporating prior information to stabilize decisions. Bayesian decision theory is fundamentally connected to utility maximization, where the loss function is the negative of a utility function , so that selecting the action maximizing the posterior expected utility yields the same optimal decisions. This linkage, axiomatized in subjective expected utility theory, ensures that rational choices under uncertainty align with coherent probability assessments, as developed in foundational works on personal probability.Computational Methods
Markov Chain Monte Carlo
Markov chain Monte Carlo (MCMC) methods are essential computational techniques in Bayesian inference for approximating posterior distributions when analytical solutions are intractable, particularly for complex models with high-dimensional parameter spaces. These methods generate a sequence of samples from a Markov chain whose stationary distribution matches the target posterior, allowing estimation of posterior expectations, credible intervals, and other summaries through Monte Carlo integration. By simulating dependent samples that converge to the posterior, MCMC enables inference in scenarios where direct sampling is impossible, such as non-conjugate priors where the posterior lacks a closed form.[51] The Metropolis-Hastings algorithm, a foundational MCMC method, constructs the Markov chain through a proposal distribution and an acceptance mechanism to ensure the chain targets the desired posterior. At each iteration, a candidate state is proposed from a distribution , where is the current state. The acceptance probability is then computed as , where denotes the unnormalized posterior density; the proposal is accepted with probability , otherwise the chain remains at . This general framework, introduced by Metropolis et al. in 1953 for symmetric proposals and extended by Hastings in 1970 to arbitrary proposals, guarantees detailed balance and thus convergence to the posterior under mild conditions.[52][53] Gibbs sampling, a special case of Metropolis-Hastings, simplifies the process for multivariate posteriors by iteratively sampling from full conditional distributions, avoiding explicit acceptance steps. For a parameter vector , the algorithm updates each component sequentially or in random order, where denotes all components except and is the data. This method, originally proposed by Geman and Geman in 1984 for image restoration, exploits conditional independence to explore the posterior efficiently, particularly in hierarchical models where conditionals are tractable despite an intractable joint.[54] Assessing MCMC convergence is crucial, as chains may mix slowly or fail to explore the posterior adequately. Trace plots visualize sample paths over iterations, revealing trends, autocorrelation, or stationarity; effective sample size, accounting for dependence, quantifies the information content relative to independent draws. The Gelman-Rubin diagnostic compares variability across multiple parallel chains started from overdispersed initials, estimating the potential scale reduction factor , where values near 1 indicate convergence; originally developed by Gelman and Rubin in 1992, it monitors both within- and between-chain variances to detect lack of equilibration.[55] In high-dimensional Bayesian inference, MCMC excels at handling posteriors with thousands of parameters, such as in genomic models or spatial statistics, by iteratively navigating complex geometries that defy analytical tractability. For instance, in large-scale regression, Metropolis-Hastings with adaptive proposals or Gibbs sampling in conjugate-like blocks scales to dimensions where direct integration fails, providing asymptotically exact approximations whose accuracy improves with chain length. These methods underpin applications in fields requiring uncertainty quantification over vast parameter spaces, though computational cost grows with dimension, motivating efficient implementations.[51]Variational Inference
Variational inference is a deterministic optimization-based approach to approximating the intractable posterior distribution in Bayesian models by selecting a simpler variational distribution that minimizes the Kullback-Leibler (KL) divergence to the true posterior .[56] This method transforms the inference problem into an optimization task, making it suitable for large-scale models where exact computation is infeasible.[57] The KL divergence, defined as , measures the information loss when using to approximate , and minimizing it yields a tight approximation when is flexible enough.[56] In variational Bayes, the approximation is achieved by maximizing the evidence lower bound (ELBO), which provides a tractable lower bound on the log marginal likelihood : This objective decomposes the KL divergence and can be optimized directly, as the marginal likelihood term is constant with respect to .[57] The ELBO balances model fit (via the expected log joint) and regularization (via the entropy of ), ensuring the approximation remains close to the prior while explaining the data.[56] A common choice for is the mean-field approximation, which assumes full independence among the parameters, factorizing as . This simplifies computations in high-dimensional spaces, such as graphical models, by decoupling the updates for each factor. Optimization often proceeds via coordinate ascent, iteratively maximizing the ELBO with respect to each while holding others fixed, leading to closed-form updates in conjugate exponential family models.[56][57] Compared to Markov chain Monte Carlo (MCMC) methods, variational inference offers significant speed advantages, scaling to millions of data points through efficient optimization, but it introduces bias due to the restrictive form of , potentially underestimating posterior uncertainty.[57] In practice, this trade-off favors variational methods for real-time applications requiring scalability, while MCMC is preferred when unbiased estimates are critical despite longer computation times.[58]Probabilistic Programming
Probabilistic programming languages facilitate the specification and inference of Bayesian models by allowing users to define probabilistic structures in code, separating model declaration from inference algorithms. These tools enable statisticians and data scientists to express complex hierarchical models intuitively, often using declarative syntax where the focus is on the joint probability distribution rather than implementation details.[59][60] JAGS (Just Another Gibbs Sampler), introduced in 2003, exemplifies declarative modeling through a BUGS-like language that represents Bayesian hierarchical models as directed acyclic graphs, specifying nodes' distributions and dependencies.[59] Stan, released in 2012, employs an imperative probabilistic programming approach in its domain-specific language, defining a log probability function over parameters and data with blocks for transformed parameters and generated quantities, offering greater expressiveness for custom computations.[61] PyMC, evolving from its 2015 version to a comprehensive framework by 2023, uses Python-based declarative syntax to build models with distributions likepm.Normal and supports hierarchical structures seamlessly.[60] These languages integrate inference engines such as Markov chain Monte Carlo (MCMC) methods—including Gibbs sampling in JAGS and Hamiltonian Monte Carlo in Stan and PyMC—and variational inference (VI) approximations, allowing automatic posterior sampling or optimization without manual coding of samplers.[59][61][60]
Key benefits of these frameworks include enhanced reproducibility, as models and inference configurations can be version-controlled and shared via code repositories, ensuring identical results with fixed seeds and software versions.[62][61] Automatic differentiation (AD) further accelerates inference; Stan computes exact gradients using reverse-mode AD for efficient MCMC, while PyMC leverages PyTensor for gradient-based VI and sampling.[61][60] JAGS, though lacking native AD, promotes reproducibility through its scripting interface and compatibility with R for transparent analysis pipelines.[59][62]
By 2025, probabilistic programming has evolved to incorporate deep learning integrations, exemplified by Pyro, a PyTorch-based language introduced in 2018 that unifies neural networks with Bayesian modeling for scalable deep probabilistic programs.[63] Pyro supports MCMC and VI engines with automatic differentiation via PyTorch, enabling hybrid models like variational autoencoders within Bayesian frameworks, and its NumPyro extension provides JAX-accelerated inference for large-scale applications.[64] This progression reflects a broader trend toward universal probabilistic programming, bridging traditional Bayesian tools with modern machine learning ecosystems.[62]
Applications
Machine Learning and AI
Bayesian neural networks (BNNs) extend traditional neural networks by placing prior distributions over the weights, enabling the quantification of epistemic uncertainty in predictions. This approach treats the network parameters as random variables, allowing the posterior distribution to capture both data fit and model uncertainty, which is particularly useful in safety-critical applications where overconfidence can be detrimental. The foundational work on BNNs was developed in Radford Neal's 1996 thesis, which demonstrated how Bayesian methods can regularize neural networks and provide principled uncertainty estimates through integration over the posterior. In practice, priors such as Gaussian distributions are commonly imposed on weights to encode assumptions about their magnitude and correlations, leading to more robust models that avoid overfitting compared to maximum likelihood estimation.[65] Gaussian processes (GPs) serve as a cornerstone of Bayesian machine learning for non-parametric regression and classification tasks, modeling functions as distributions over possible mappings from inputs to outputs. In regression, GPs use a kernel function to define the covariance structure, yielding predictive distributions that naturally incorporate uncertainty, with the mean function providing point estimates and the variance reflecting confidence intervals. For classification, GPs extend this framework via latent function approximations, such as through Laplace methods or variational techniques, to handle binary or multi-class problems while maintaining probabilistic outputs. The seminal formulation of GPs for machine learning was advanced in the 2006 book by Rasmussen and Williams, which established GPs as a flexible alternative to parametric models, especially effective for small-to-medium datasets where exact inference is feasible.[66] GPs excel in scenarios requiring interpretable uncertainty, such as time-series forecasting or spatial interpolation, and their Bayesian nature ensures that predictions update coherently with new data. Active learning leverages Bayesian methods to select the most informative data points for labeling, reducing the annotation burden in supervised learning pipelines. By querying instances that maximize expected information gain—often measured via mutual information between predictions and model parameters—Bayesian active learning efficiently explores the data space, particularly when integrated with GPs or BNNs as surrogate models. A influential approach, BALD (Bayesian Active Learning by Disagreement), uses the mutual information between predictions and posterior parameters to prioritize queries that resolve parameter uncertainty. This method, building on earlier information-theoretic frameworks, has shown substantial label efficiency gains in image classification tasks.[67] Complementing active learning, Bayesian optimization employs GPs to model objective functions in black-box settings, iteratively selecting points via acquisition functions like expected improvement to balance exploration and exploitation. The expected improvement criterion, introduced in Jones et al.'s 1998 work, has become a standard for hyperparameter tuning and experimental design, achieving faster convergence than grid search or random sampling in high-dimensional spaces.[68] In the 2020s, advancements have focused on scalable inference for BNNs in large-scale models, addressing the computational challenges of exact posterior approximation through variational inference (VI) and related techniques. VI approximates the posterior by optimizing a lower bound on the evidence, enabling efficient training of BNNs with millions of parameters by amortizing inference across mini-batches. Notable progress includes rank-1 factorizations that reduce the parameter space while preserving uncertainty calibration, as demonstrated in Dusenberry et al.'s 2020 method, which improved scalability on datasets like CIFAR-10 without sacrificing predictive performance.[69] These developments have facilitated the integration of Bayesian principles into deep learning architectures, enhancing reliability in domains like autonomous systems and natural language processing. Predictive distributions in these models provide calibrated uncertainties that guide decision-making under limited data.Bioinformatics and Healthcare
Bayesian inference plays a pivotal role in phylogenetic analysis by incorporating priors on evolutionary trees to estimate relationships among species or sequences from genomic data. In this framework, priors such as the birth-death sampling process model the tree topology and branch lengths, accounting for incomplete sampling and extinction events to produce posterior distributions of phylogenies. Seminal software like MrBayes implements Markov chain Monte Carlo (MCMC) sampling to explore these posteriors under mixed substitution models, enabling robust inference even with sparse data. Similarly, BEAST extends this by integrating time-calibrated trees with molecular clock priors, facilitating divergence time estimation in molecular epidemiology and evolutionary biology. These methods have revolutionized systematics by quantifying uncertainty in tree topologies and supporting hypotheses like adaptive radiations through posterior probabilities. In drug discovery, Bayesian adaptive trials optimize clinical development by dynamically adjusting enrollment, dosing, or arms based on interim data, incorporating historical priors to enhance efficiency and ethical patient allocation. For instance, multi-arm multi-stage designs use posterior probabilities to drop ineffective treatments early, reducing sample sizes while maintaining power, as demonstrated in oncology trials where priors from preclinical data inform efficacy thresholds.[70] High-impact applications include the I-SPY 2 trial, which employed Bayesian hierarchical models to predict pathological complete response rates, accelerating the identification of promising therapies for breast cancer subtypes. This approach minimizes exposure to futile regimens and integrates real-time learning, contrasting with fixed frequentist designs by leveraging accumulating evidence for dose escalation or futility stopping. Genomic data analysis benefits from Bayesian hierarchical models to detect and characterize genetic variants, such as single nucleotide polymorphisms (SNPs), by pooling information across loci or populations to shrink effect estimates and control false positives. These models place hyperpriors on variant effects, enabling variable selection in genome-wide association studies (GWAS) where thousands of markers are tested simultaneously, as in the Bayesian lasso approach that penalizes small effects while highlighting causal variants.[71] For structural variants like copy number variations (CNVs), hierarchical priors model probe-level noise and allelic imbalance, inferring segment boundaries and ploidy states from next-generation sequencing reads with improved resolution over non-Bayesian methods. Such frameworks have identified population-specific selection signals in human genomes, quantifying admixture and linkage disequilibrium through posterior credible intervals. During the COVID-19 pandemic, Bayesian extensions of the susceptible-infected-recovered (SIR) model incorporated informative priors on transmission rates and reporting biases to forecast epidemics and evaluate interventions across regions. These models used time-varying priors derived from early outbreak data to update basic reproduction numbers (R_t) sequentially, capturing multiple waves and non-pharmaceutical effects like lockdowns with spatiotemporal hierarchies. Influential analyses, such as those integrating changepoint detection, estimated underreporting factors and intervention impacts in the United Kingdom, providing probabilistic forecasts that informed policy with uncertainty bands.[72] By briefly referencing sequential updating with incoming case data, these approaches allowed real-time refinement of parameters without refitting from scratch.Astrophysics and Cosmology
In astrophysics and cosmology, Bayesian inference plays a central role in parameter estimation for the standard ΛCDM model, particularly through analyses of cosmic microwave background (CMB) data from the Planck satellite. The Planck Collaboration employed Markov chain Monte Carlo (MCMC) methods within a Bayesian framework to derive constraints on cosmological parameters such as the Hubble constant, matter density, and amplitude of scalar perturbations, yielding precise posteriors that confirm the flatness of the universe and the presence of cold dark matter at approximately 26% of the energy density.[73] These inferences integrate likelihoods from temperature and polarization anisotropies, incorporating priors informed by previous missions like WMAP, to quantify uncertainties and tensions, such as the Hubble constant discrepancy.[73] Bayesian model comparison has been instrumental in evaluating hypotheses about dark matter, such as comparing cold dark matter (CDM) profiles against cored or warm dark matter (WDM) alternatives using dwarf galaxy data. For instance, analyses of Milky Way satellites like Fornax and Sculptor applied Bayesian evidence calculations to assess Navarro-Frenk-White (NFW) cuspy profiles versus Burkert cored models, finding strong preference for cored profiles in some systems due to the Occam penalty favoring simpler fits to rotation curves and stellar kinematics.[74] In broader cosmological contexts, such comparisons extend to WDM models constrained by Lyman-alpha forest data, where Bayesian evidence disfavors pure WDM over CDM but allows mixed scenarios to alleviate small-scale structure issues. Hierarchical Bayesian modeling enhances inference from large galaxy surveys by accounting for population-level variations and selection effects. In surveys like the Baryon Oscillation Spectroscopic Survey (BOSS), hierarchical approaches model galaxy clustering and redshift-space distortions, treating individual galaxy redshifts as draws from a shared cosmological power spectrum while marginalizing over astrophysical nuisance parameters like bias. This framework propagates uncertainties through the hierarchy, enabling robust constraints on parameters like the growth rate of structure, and has been adapted for forward modeling in upcoming surveys such as DESI to forecast dark energy properties. Recent advancements leverage Bayesian methods for James Webb Space Telescope (JWST) data, enabling inference on high-redshift galaxy properties and early universe cosmology. Post-2022 analyses of JWST's NIRCam and MIRI observations use simulation-based Bayesian inference to fit spectral energy distributions of galaxies at z > 10, constraining star formation histories and escape fractions while incorporating JWST-specific systematics like point-spread function variations.[75] These efforts challenge ΛCDM by probing reionization-era feedback, with hierarchical models integrating JWST photometry to infer global parameters like the ionizing photon budget. In gravitational wave astronomy, Bayesian inference underpins signal detection and parameter estimation by LIGO and Virgo collaborations. For events like GW150914, nested sampling algorithms compute posteriors on source masses, spins, and sky locations by comparing waveform models against detector noise, achieving sub-percent precision on chirp masses through marginalization over calibration errors. Hierarchical extensions further infer population properties, such as merger rates, from multiple detections, informing astrophysical models of binary black hole formation. As datasets grow, asymptotic approximations facilitate efficient inference on large-scale gravitational wave catalogs.Philosophical and Historical Context
Bayesian Epistemology
Bayesian epistemology posits that rational degrees of belief, or credences, must satisfy the axioms of probability to ensure coherence among one's opinions. This coherence theory requires that credences be non-negative, sum to one over complementary propositions, and be additive for disjoint events, thereby avoiding internal inconsistencies in belief structures.[76] Probabilism, as this norm is known, serves as a foundational constraint, dictating that beliefs ought to cohere probabilistically to prevent irrationality.[76] Dutch book arguments provide a pragmatic justification for treating subjective probabilities as coherent degrees of belief, demonstrating that violations of probability axioms expose an agent to guaranteed losses in fair betting scenarios. A Dutch book consists of a set of bets that appear individually acceptable based on the agent's credences but collectively result in a sure loss, such as assigning a credence greater than 1 to an event or failing additivity for mutually exclusive outcomes.[77] These arguments, rooted in the idea that rational agents avoid sure losses, compel subjective probabilities to align with probabilistic coherence, though critics note that agents might rationally decline certain bets or that incoherence does not always lead to exploitation.[77] In Bayesian confirmation theory, evidence confirms a hypothesis if it increases the agent's credence in that hypothesis upon updating beliefs, while disconfirmation occurs if the credence decreases. Specifically, evidence confirms hypothesis when the posterior probability , the prior probability, often measured by the Bayesian multiplier , where is the likelihood and the marginal probability of the evidence.[78] This framework quantifies evidential support through ratios or differences in probabilities, allowing hypotheses to be incrementally strengthened or weakened by data, such as a black raven observation mildly confirming the hypothesis that all ravens are black under uniform priors.[78] Updating beliefs via conditionalization preserves these confirmation relations, ensuring that new evidence coherently revises the probability distribution.[76] Critiques of Bayesian epistemology often center on the tension between subjective and objective variants. Subjective Bayesianism permits any coherent prior probability assignment, emphasizing personal degrees of belief without further constraints, which allows for diverse but potentially biased inferences.[76] In contrast, objective Bayesianism imposes additional norms, such as the principle of indifference or maximum entropy priors, to derive unbiased probabilities from available information, aiming for intersubjective agreement and scientific objectivity.[79] Detractors of subjective Bayesianism argue it leads to practical inconsistencies, like marginalization paradoxes, and relies on unverifiable personal priors, while objective approaches face challenges like Bertrand's paradox in uniform prior selection, potentially undermining uniqueness.[79]Historical Development
The foundations of Bayesian inference trace back to the posthumous publication in 1763 of Thomas Bayes's essay "An Essay towards solving a Problem in the Doctrine of Chances," which introduced a method for inverting conditional probabilities that forms the basis of what is now known as Bayes's theorem.[24] This work, communicated by Richard Price after Bayes's death, laid the groundwork for updating beliefs in light of new evidence, though it remained relatively obscure for decades.[80] In 1812, Pierre-Simon Laplace independently developed a similar framework in his Théorie Analytique des Probabilités, where he explicitly formulated the rule for inverse probabilities and applied it to problems in astronomy and celestial mechanics, effectively popularizing the approach without reference to Bayes.[81] Laplace's contributions emphasized the theorem's utility in scientific inference, marking an early expansion of its scope beyond Bayes's initial probabilistic inverse problem. Amid the dominance of frequentist statistics in the early 20th century, Bayesian methods were defended in debates over statistical inference, as seen in Harold Jeffreys' 1939 publication of Theory of Probability, advocating for objective priors to resolve issues of subjectivity in Bayesian analysis and applying the approach to geophysical problems, thereby defending its role in scientific hypothesis testing.[82] This work countered criticisms by proposing reference priors that minimized prior influence on posteriors.[83] Complementing this, Leonard J. Savage's 1954 book The Foundations of Statistics provided a subjective interpretation, axiomatizing personal probability and decision theory within a Bayesian framework, which unified utility and belief updating.[84] Savage's axioms demonstrated how coherent behavior implies Bayesian updating, influencing decision-theoretic applications.[85] Bayesian inference experienced a major revival in the 1990s, driven by computational advances that addressed longstanding integration challenges in posterior estimation. The 1990 paper by Alan E. Gelfand and Adrian F. M. Smith introduced sampling-based methods using Markov chain Monte Carlo (MCMC) to approximate marginal densities, enabling practical Bayesian analysis for complex models.[86] This innovation, particularly Gibbs sampling variants, facilitated the method's widespread adoption in statistics and beyond, marking the shift from theoretical foundations to computationally feasible inference.[87]Thomas Bayes and Beyond
Thomas Bayes (1701–1761), an English mathematician and Presbyterian minister, developed the foundational concept of inverse probability, which allows updating beliefs about causes based on observed effects. His key contribution, detailed in an unpublished essay discovered after his death, addressed the probability of causes from known effects, providing the mathematical framework now recognized as Bayes' theorem. This work, edited and published posthumously by Richard Price in 1763 as "An Essay towards solving a Problem in the Doctrine of Chances," introduced a uniform prior distribution for unknown probabilities and proposed a method using imaginary outcomes to approximate posterior distributions, though it remained largely overlooked for decades.[88][89][90] Pierre-Simon Laplace (1749–1827), a prominent French mathematician and astronomer, independently derived and generalized Bayes' theorem in the late 18th century, framing it as the principle of inverse probability to reason from effects to causes. In his 1774 memoir "Mémoire sur la probabilité des causes par les événements," Laplace applied this principle to estimate probabilities in astronomical observations, such as planetary perturbations, demonstrating its utility in scientific contexts. He further expanded its applications in his 1812 treatise Théorie Analytique des Probabilités, integrating inverse probability with error analysis and celestial mechanics, which helped establish probability theory as a tool for empirical inference across disciplines.[91][88] Harold Jeffreys (1891–1989), a British mathematician, statistician, and geophysicist, played a pivotal role in reviving Bayesian methods during the early 20th century amid the rise of frequentist approaches. In his influential book Theory of Probability (1939), first edition published by Oxford University Press, Jeffreys articulated a systematic theory of scientific inference grounded in Bayesian principles, emphasizing the use of probability for hypothesis testing and parameter estimation. He proposed objective prior distributions, such as the Jeffreys prior invariant under reparameterization, and applied Bayesian techniques to geophysical problems like earthquake prediction, arguing that inverse probability provided a more coherent basis for inductive reasoning than likelihood-based methods. The book, revised in multiple editions through 1961, became a cornerstone for Bayesian advocates in scientific fields.[38] Following World War II, Bayesian inference gained renewed momentum through key figures who advanced its theoretical foundations and computational feasibility. Dennis Lindley (1923–2013), a British statistician, became a leading proponent of Bayesian decision theory, integrating utility maximization with probability updating to guide rational choice under uncertainty; he co-authored seminal works on Bayesian experimental design and founded the Valencia International Meetings on Bayesian Statistics in 1979 to foster global collaboration. Bruno de Finetti (1906–1985), an Italian probabilist, formalized subjective probability as degrees of belief coherent under Dutch book arguments, rejecting objective frequencies in favor of personal probabilities updated via Bayes' rule, as detailed in his multi-volume Teoria delle Probabilità (1974–1975). In the computational era, Radford Neal advanced practical Bayesian analysis by developing Markov Chain Monte Carlo (MCMC) methods, particularly in his 1993 technical report "Probabilistic Inference Using Markov Chain Monte Carlo Methods," which demonstrated efficient sampling from complex posterior distributions, enabling applications in machine learning and neural networks. These contributions transformed Bayesian thought from philosophical abstraction to a computationally viable paradigm for modern data analysis.[92][93][94]References
- https://people.eecs.berkeley.edu/~[jordan](/page/Jordan)/courses/260-spring10/other-readings/chapter9.pdf
