Hubbry Logo
Bayesian inferenceBayesian inferenceMain
Open search
Bayesian inference
Community hub
Bayesian inference
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Bayesian inference
Bayesian inference
from Wikipedia

Bayesian inference (/ˈbziən/ BAY-zee-ən or /ˈbʒən/ BAY-zhən)[1] is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inference uses a prior distribution to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Introduction to Bayes' rule

[edit]
A geometric visualisation of Bayes' theorem. In the table, the values 2, 3, 6 and 9 give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded. This shows that i.e. . Similar reasoning can be used to show that etc.

Formal explanation

[edit]
Contingency table
Hypothesis


Evidence
Satisfies
hypothesis
H
Violates
hypothesis

Total
Has evidence
E


No evidence


=
Total    1

Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:

where

  • H stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.
  • , the prior probability, is the estimate of the probability of the hypothesis H before the data E, the current evidence, is observed.
  • E, the evidence, corresponds to new data that were not used in computing the prior probability.
  • , the posterior probability, is the probability of H given E, i.e., after E is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
  • is the probability of observing E given H and is called the likelihood. As a function of E with H fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, E, while the posterior probability is a function of the hypothesis, H.
  • is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis H does not appear anywhere in the symbol, unlike for all the other factors) and hence does not factor into determining the relative probabilities of different hypotheses.
  • (Else one has .)

For different values of H, only the factors and , both in the numerator, affect the value of  – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

In cases where ("not H"), the logical negation of H, is a valid likelihood, Bayes' rule can be rewritten as follows:

because

and

This focuses attention on the term

If that term is approximately 1, then the probability of the hypothesis given the evidence, , is about , about 50% likely - equally likely or not likely. If that term is very small, close to zero, then the probability of the hypothesis, given the evidence, is close to 1 or the conditional hypothesis is quite likely. If that term is very large, much larger than 1, then the hypothesis, given the evidence, is quite unlikely. If the hypothesis (without consideration of evidence) is unlikely, then is small (but not necessarily astronomically small) and is much larger than 1 and this term can be approximated as and relevant probabilities can be compared directly to each other.

One quick and easy way to remember the equation would be to use rule of multiplication:

Alternatives to Bayesian updating

[edit]

Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote:[2] "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics") following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability.[3] The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.[4]

Inference over exclusive and exhaustive possibilities

[edit]

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

General formulation

[edit]
Diagram illustrating event space in general formulation of Bayesian inference. Although this diagram shows discrete models and events, the continuous case may be visualized similarly using probability densities.

Suppose a process is generating independent and identically distributed events , but the probability distribution is unknown. Let the event space represent the current state of belief for this process. Each model is represented by event . The conditional probabilities are specified to define the models. is the degree of belief in . Before the first inference step, is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.

Suppose that the process is observed to generate . For each , the prior is updated to the posterior . From Bayes' theorem:[5]

Upon observation of further evidence, this procedure may be repeated.

Multiple observations

[edit]

For a sequence of independent and identically distributed observations , it can be shown by induction that repeated application of the above is equivalent to where

Parametric formulation: motivating the formal description

[edit]

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is, however, equally applicable to discrete distributions.

Let the vector span the parameter space. Let the initial prior distribution over be , where is a set of parameters to the prior itself, or hyperparameters. Let be a sequence of independent and identically distributed event observations, where all are distributed as for some . Bayes' theorem is applied to find the posterior distribution over :

where

Formal description of Bayesian inference

[edit]

Definitions

[edit]
  • , a data point in general. This may in fact be a vector of values.
  • , the parameter of the data point's distribution, i.e., . This may be a vector of parameters.
  • , the hyperparameter of the parameter distribution, i.e., . This may be a vector of hyperparameters.
  • is the sample, a set of observed data points, i.e., .
  • , a new data point whose distribution is to be predicted.

Bayesian inference

[edit]
  • The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. . The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations.
  • The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written .
  • The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise.[6] If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
  • The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference: This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
  • In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution is not obtained in a closed form distribution, mainly because the parameter space for can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations and parameter . In such situations, we need to resort to approximation techniques.[7]
  • General case: Let be the conditional distribution of given and let be the distribution of . The joint distribution is then . The conditional distribution of given is then determined by

Existence and uniqueness of the needed conditional expectation is a consequence of the Radon–Nikodym theorem. This was formulated by Kolmogorov in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface.[8] The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions.[9] Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line.[10] Modern Markov chain Monte Carlo methods have boosted the importance of Bayes' theorem including cases with improper priors.[11]

Bayesian prediction

[edit]
  • The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:
  • The prior predictive distribution is the distribution of a new data point, marginalized over the prior:

Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.

In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the facts that (1) the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.

Both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

Mathematical properties

[edit]

Interpretation of factor

[edit]

. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, . That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

Cromwell's rule

[edit]

If then . If and , then . This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not " in place of "", yielding "if , then ", from which the result immediately follows.

Asymptotic behaviour of posterior

[edit]

Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 [12] and 1965 [13] when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces.[14] To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

Conjugate priors

[edit]

In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.

Estimates of parameters and predictions

[edit]

It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.[15]

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.[16]

Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates:[17]

There are examples where no maximum is attained, in which case the set of MAP estimates is empty.

There are other methods of estimation that minimize the posterior risk (expected-posterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics").[18]

The posterior predictive distribution of a new observation (that is independent of previous observations) is determined by[19]

Examples

[edit]

Probability of a hypothesis

[edit]
Contingency table
Bowl

Cookie
#1
H1
#2
H2

Total
Plain, E 30 20 50
Choc, ¬E 10 20 30
Total 40 40 80
P(H1|E) = 30 / 50 = 0.6

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?

Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let correspond to bowl #1, and to bowl #2. It is given that the bowls are identical from Fred's point of view, thus , and the two must add up to 1, so both are equal to 0.5. The event is the observation of a plain cookie. From the contents of the bowls, we know that and Bayes' formula then yields

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, , which was 0.5. After observing the cookie, we must revise the probability to , which is 0.6.

Making a prediction

[edit]
Example results for archaeology example. This simulation was generated using c=15.2.

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?

The degree of belief in the continuous variable (century) is to be calculated, with the discrete set of events as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,

Assume a uniform prior of , and that trials are independent and identically distributed. When a new fragment of type is discovered, Bayes' theorem is applied to update the degree of belief for each :

A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or . By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The Bernstein-von Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events is finite (see above section on asymptotic behaviour of the posterior).

In frequentist statistics and decision theory

[edit]

A decision-theoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.[20]

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals.[21][22][23] For example:

  • "Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."[20]
  • "In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."[24]
  • "In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."[25]
  • "A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"[26]
  • "An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained."[27]

Model selection

[edit]

Bayesian methodology also plays a role in model selection where the aim is to select one model from a set of competing models that represents most closely the underlying process that generated the observed data. In Bayesian model comparison, the model with the highest posterior probability given the data is selected. The posterior probability of a model depends on the evidence, or marginal likelihood, which reflects the probability that the data is generated by the model, and on the prior belief of the model. When two competing models are a priori considered to be equiprobable, the ratio of their posterior probabilities corresponds to the Bayes factor. Since Bayesian model comparison is aimed on selecting the model with the highest posterior probability, this methodology is also referred to as the maximum a posteriori (MAP) selection rule [28] or the MAP probability rule.[29]

Probabilistic programming

[edit]

While conceptually simple, Bayesian methods can be mathematically and numerically challenging. Probabilistic programming languages (PPLs) implement functions to easily build Bayesian models together with efficient automatic inference methods. This helps separate the model building from the inference, allowing practitioners to focus on their specific problems and leaving PPLs to handle the computational details for them.[30][31][32]

Applications

[edit]

Statistical data analysis

[edit]

See the separate Wikipedia entry on Bayesian statistics, specifically the statistical modeling section in that page.

Computer applications

[edit]

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s.[33] There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes.[34] Recently[when?] Bayesian inference has gained popularity among the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.

As applied to statistical classification, Bayesian inference has been used to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, Mozilla, XEAMS, and others. Spam classification is treated in more detail in the article on the naïve Bayes classifier.

Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and Occam's Razor.[35][unreliable source?] Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.[36][37]

Bioinformatics and healthcare applications

[edit]

Bayesian inference has been applied in different bioinformatics applications, including differential gene expression analysis.[38] Bayesian inference is also used in a general cancer risk model, called CIRI (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge.[39][40]

Cosmology and astrophysical applications

[edit]

The Bayesian approach has been central to recent progress in cosmology and astrophysical applications,[41][42] and extends to a wide range of astrophysical problems, including the characterisation of exoplanet (such as the fitting of atmosphere for k2-18b[43]), parameter constraints with cosmological data,[44] and calibration in astrophysical experiments.[45]

In cosmology, it is often employed with computational techniques such as Markov chain Monte Carlo (MCMC) and nested sampling algorithm to analyse complex datasets and navigate high-dimensional parameter space. A notable application is the Planck 2018 CMB data for parameter inference.[44] The six base cosmological parameters in Lambda-CDM model are not predicted by a theory, but rather fitted from Cosmic microwave background (CMB) data to a chosen model of cosmology (the Lambda-CDM model).[46] The bayesian code for cosmology `cobaya` [47] sets up cosmological runs and interfaces cosmological likelihoods, Boltzmann code,[48][49] which computes the predicted CMB anisotropies for any given set of cosmological parameters, with MCMC or nested sampler.

This computational framework is not limited to the standard model, it is also essential for testing alternative or extended theories of cosmology, such as theories with early dark energy,[50] or modified gravity theories introducing additional parameters beyond Lambda-CDM. Bayesian model comparison can then be employed to calculate the evidence for competing models, providing a statistical basis to assess whether the data support them over the standard Lambda-CDM.[51]

In the courtroom

[edit]

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for "beyond a reasonable doubt".[52][53][54] Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.

Adding up evidence

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.[55] For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.

The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."

Gardner-Medwin[56] argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:

A – the known facts and testimony could have arisen if the defendant is guilty.
B – the known facts and testimony could have arisen if the defendant is innocent.
C – the defendant is guilty.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

The O.J. Simpson murder trial

[edit]

The O.J. Simpson trial is frequently cited as a classic example of the misuse of Bayes' theorem in legal reasoning. Defense attorney Alan Dershowitz argued that since only 0.1% of men who abuse their wives go on to murder them, Simpson's history of abuse was statistically irrelevant. This reasoning is flawed because it ignores the crucial fact that Simpson's wife had in fact been murdered. Once this information is incorporated, Bayesian analysis shows that the probability of the abusive husband being guilty rises dramatically, estimated at around 81%, making the abuse history significant evidence rather than background noise.[57]

Too good to be true: When overwhelming evidence fails to convince

[edit]

This paper[58] uses Bayesian probabilistic framework to study a paradox: situations where accumulating unanimous evidence, a seemingly definitive indicator, can decrease confidence in a hypothesis when the assumptions of independent observations are compromised by even a tiny risk of systemic failure.[58]

Typically, each independent piece of supporting evidence raises confidence in a hypothesis. However, the inclusion of a small probability of systemic error, a failure mode affecting all observations, alters the implications: excessive agreement may actually imply bias rather than truth. Thus, rather than reinforcing belief, too much consistency can generate suspicion and reduced confidence.[58]

Illustrative examples
[edit]
Archaeological: The Roman pot
[edit]

In one scenario, a clay pot's origin is tested (e.g., whether it's from Britain). As repeated tests indicate the same result, confidence initially increases. However, when a small systemic failure rate (e.g., 1% lab contamination) is introduced, the posterior confidence peaks after a few unanimous results and then declines as continued unanimity increasingly suggests contamination rather than independent confirmation. Eventually, confidence can drop close to 50%, equivalent to random guess, demonstrating that perfection in agreement can erode belief.[58]

Legal: Ancient Jewish law (Talmudic example)
[edit]

The paper references an ancient judicial principle: in Jewish law, a defendant cannot be convicted of a capital crime if all judges unanimously find guilt. This rule, while seemingly counter-intuitive, reflects an intuitive recognition of the paradox - that absolute agreement may signal systemic failure rather than reliability.[58]

This aligns with the paper's formal analysis of legal scenarios like witness line-ups, where even a 1% bias causes confidence to decline after approximately three unanimous identifications - further unanimous affirmations make conviction less credible, not more.[58]

The Talmudic rule against unanimous guilty verdicts now finds a formal justification: perfect agreement may be less convincing than moderate, independent agreement.[58]

Cryptography
[edit]

In cryptographic systems relying on repeated tests (e.g., Rabin–Miller), hardware or software issues can systematically fail tests. The study shows that ignoring systemic failure can lead to vast underestimation of false-negative rates by factors as large as 2^80.[58]

Bayesian epistemology

[edit]

Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

Karl Popper and David Miller have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences:[59] It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

Other

[edit]

Bayes and Bayesian inference

[edit]

The problem considered by Bayes in Proposition 9 of his essay, "An Essay Towards Solving a Problem in the Doctrine of Chances", is the posterior distribution for the parameter a (the success rate) of the binomial distribution.[citation needed]

History

[edit]

The term Bayesian refers to Thomas Bayes (1701–1761), who proved that probabilistic limits could be placed on an unknown event.[66] However, it was Pierre-Simon Laplace (1749–1827) who introduced (as Principle VI) what is now called Bayes' theorem and used it to address problems in celestial mechanics, medical statistics, reliability, and jurisprudence.[67] Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes[68]). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics.[68]

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed,[69] and the method assigning the prior, which differs from one objective Bayesian practitioner to another. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.[70] Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.[71] Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.[72]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Bayesian inference is a method of that employs to update the probability of a or as new becomes available, by combining prior beliefs with the likelihood of observed to produce a distribution. This approach treats probabilities as degrees of belief rather than long-run frequencies, allowing for the explicit incorporation of uncertainty and prior knowledge into the inference process. The foundations of Bayesian inference trace back to the , with , an English mathematician and Presbyterian minister, who developed the core theorem in an essay published posthumously in 1763 by . , a French mathematician, independently derived and expanded upon in the late 1700s, applying it to problems in astronomy, physics, and probability, thereby establishing early applied Bayesian methods such as the normal-normal conjugate model. Although the approach waned in popularity during the early due to the rise of frequentist statistics, it experienced a revival in the mid-20th century through works on hierarchical modeling and , and further advanced in the late 20th and 21st centuries with computational innovations enabling complex nonconjugate models and posterior predictive checking. At its core, Bayesian inference revolves around three fundamental elements: the prior distribution, which encodes initial beliefs or knowledge about the parameters before observing data; the , which quantifies the probability of the data given those parameters; and the posterior distribution, obtained by proportionally multiplying the prior and likelihood via . This framework contrasts with frequentist methods, which treat parameters as fixed unknowns and rely solely on data-derived estimates like confidence intervals, whereas Bayesian approaches yield credible intervals that directly interpret the probability of parameter values. Beyond the theorem itself, Bayesian inference incorporates the for marginalization over nuisance parameters, enabling robust handling of uncertainties in composite hypotheses and systematic errors. Bayesian inference has broad applications across disciplines, including for modeling cognitive processes, astronomy for analyzing survey data and inferring cosmic properties, and statistics for hierarchical modeling and model comparison. Its emphasis on probabilistic predictions and makes it particularly valuable in fields requiring under incomplete information, such as , , and .

Fundamentals

Bayes' Theorem

Bayes' theorem is a fundamental result in that describes how to update the probability of a based on new . It is derived from the basic definition of . The conditional probability P(AB)P(A \mid B) of event AA given event BB (with P(B)>0P(B) > 0) is defined as the ratio of the joint probability P(AB)P(A \cap B) to the marginal probability P(B)P(B): P(AB)=P(AB)P(B).P(A \mid B) = \frac{P(A \cap B)}{P(B)}. Similarly, the reverse conditional probability is P(BA)=P(AB)P(A),P(B \mid A) = \frac{P(A \cap B)}{P(A)}, assuming P(A)>0P(A) > 0. Equating the two expressions for the joint probability yields P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) P(B) = P(B \mid A) P(A), and solving for P(AB)P(A \mid B) gives P(AB)=P(BA)P(A)P(B).P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}. This is Bayes' theorem, where P(B)P(B) in the denominator is the marginal probability of BB, often computed as P(B)=iP(BAi)P(Ai)P(B) = \sum_i P(B \mid A_i) P(A_i) over a partition of events {Ai}\{A_i\}. In terms of inference, Bayes' theorem formalizes the process of updating the probability of a hypothesis HH in light of evidence EE, yielding the posterior probability P(HE)P(H \mid E) as proportional to the product of the prior probability P(H)P(H) and the likelihood P(EH)P(E \mid H), normalized by the total probability of the evidence P(E)P(E). This framework enables the revision of initial beliefs about causes or states based on observed effects or data. A useful verbal interpretation of the theorem uses odds ratios. The posterior odds in favor of hypothesis AA over alternative BB given evidence DD are the prior odds P(A)P(B)\frac{P(A)}{P(B)} multiplied by the likelihood ratio P(DA)P(DB)\frac{P(D \mid A)}{P(D \mid B)}, which quantifies how much more (or less) likely the evidence is under AA than under BB. If the likelihood ratio exceeds 1, the evidence strengthens support for AA; if below 1, it weakens it. The theorem is named after Thomas Bayes (c. 1701–1761), an English mathematician and Presbyterian minister, who formulated it in an essay likely written in the late 1740s but published posthumously in 1763 as "An Essay Towards Solving a Problem in the Doctrine of Chances" in the Philosophical Transactions of the Royal Society, edited by his colleague . Independently, the French mathematician rediscovered the result around 1774 and developed its applications in , with his 1812 treatise giving it wider prominence before Bayes's name was retroactively attached by R. A. Fisher in 1950.

Prior, Likelihood, and Posterior

In Bayesian inference, the prior distribution encodes the initial beliefs or about the unknown θ before any data are observed. It is a assigned to the , which can incorporate expert opinion, historical data, or theoretical considerations. Subjective priors reflect the personal degrees of belief of , as emphasized in the subjectivist interpretation of probability, where probabilities are coherent previsions that avoid Dutch books. Objective priors, on the other hand, aim to be minimally informative and free from subjective input, such as uniform priors over a bounded or the , which is derived from the matrix to ensure invariance under reparameterization. The quantifies the probability of observing the y given a specific value of the parameters θ, denoted as p(yθ)p(y \mid \theta). It arises from the probabilistic model of the -generating and is typically specified based on the assumed , such as a normal or binomial likelihood depending on the nature of the . Unlike in frequentist statistics, where the likelihood is used to estimate point values of θ, in Bayesian inference it serves to update the prior by weighting parameter values according to how well they explain the observed . The posterior distribution represents the updated beliefs about the parameters after incorporating the data, given by as p(θy)p(yθ)p(θ)p(\theta \mid y) \propto p(y \mid \theta) p(\theta). This proportionality holds because the full expression includes a , the p(y)=p(yθ)p(θ)dθp(y) = \int p(y \mid \theta) p(\theta) \, d\theta, which integrates over all possible values to ensure the posterior is a valid . The , also known as the evidence or model probability, plays a crucial role in comparing different models, as it measures the overall predictive adequacy of the model without conditioning on specific parameters.

Updating Beliefs

In Bayesian inference, the process of updating beliefs begins with a prior distribution that encodes an agent's initial state of knowledge or subjective beliefs about an uncertain or . As new in the form of observed arrives, this prior is systematically revised to produce a posterior distribution that integrates the from the data, weighted by its likelihood under different possible values of the parameter. This dynamic revision reflects a coherent approach to learning, where beliefs evolve rationally in response to , allowing for the quantification and throughout the inference process. The mathematical basis for this updating is , which formalizes the combination of prior beliefs and into updated posteriors. An insightful reformulation expresses the process in terms of ratios: the posterior in favor of one hypothesis over another equal the prior multiplied by the , a quantity that captures solely the evidential impact of the by comparing the likelihoods under the competing hypotheses. This odds-based view, pioneered by , separates the roles of initial beliefs and data-driven , facilitating the assessment of how strongly observations support or refute particular models. While Bayesian updating relies on probabilistic priors and likelihoods, alternative frameworks offer contrasting approaches to . Logical probability methods, as developed by , derive degrees of from the structural similarities between and hypotheses using purely logical principles, eschewing subjective priors in favor of objective inductive rules. In a different vein, the Dempster-Shafer theory extends beyond additive probabilities by employing belief functions that distribute mass over subsets of hypotheses, enabling the representation of both uncertainty and ignorance without committing to precise point probabilities; this allows for more flexible combination of sources compared to strict Bayesian conditioning. These alternatives highlight limitations in Bayesian methods, such as sensitivity to prior specification, but often sacrifice the full coherence and normalization properties of probability. A fundamental for effective Bayesian updating is , which cautions against assigning prior probabilities of exactly zero to logically possible events or one to logically impossible ones, as such extremes can immunize beliefs against contradictory —for example, a zero prior ensures the posterior remains zero irrespective of strength. Articulated by Dennis Lindley and inspired by Oliver Cromwell's plea to "think it possible you may be mistaken," this rule promotes priors that remain responsive to , fostering robust even under incomplete initial .

Bayesian Updating

Single Observation

In Bayesian inference, updating the belief about a parameter θ\theta upon observing a single data point xx follows directly from , yielding the posterior distribution p(θx)p(xθ)p(θ)p(\theta \mid x) \propto p(x \mid \theta) p(\theta), where p(θ)p(\theta) denotes the prior distribution and p(xθ)p(x \mid \theta) the . The symbol \propto indicates proportionality, as the posterior is the unnormalized product of the likelihood and prior; to obtain the proper , it must be scaled by the (or ) p(x)=p(xθ)p(θ)dθp(x) = \int p(x \mid \theta) p(\theta) \, d\theta for continuous θ\theta, ensuring the posterior integrates to 1. This framework is particularly straightforward when θ\theta represents discrete hypotheses that are mutually exclusive and exhaustive, such as a {θ1,,θk}\{\theta_1, \dots, \theta_k\}. In this case, the for each hypothesis is P(θix)=P(xθi)P(θi)j=1kP(xθj)P(θj)P(\theta_i \mid x) = \frac{P(x \mid \theta_i) P(\theta_i)}{\sum_{j=1}^k P(x \mid \theta_j) P(\theta_j)}, where the denominator serves as the , explicitly computable as the sum of the joint probabilities over all hypotheses. For simple cases with few hypotheses, such as binary outcomes (e.g., two competing explanations), this normalization is direct: if the prior are P(θ1)/P(θ2)P(\theta_1)/P(\theta_2) and the likelihood ratio is P(xθ1)/P(xθ2)P(x \mid \theta_1)/P(x \mid \theta_2), the posterior odds become their product, with the marginal P(x)P(x) following as P(xθ1)P(θ1)+P(xθ2)P(θ2)P(x \mid \theta_1) P(\theta_1) + P(x \mid \theta_2) P(\theta_2). To illustrate, consider updating the prior probability of rain tomorrow (0.1) based on a single weather reading, such as a cloudy morning, where the likelihood of clouds given rain is 0.8 and the marginal probability of clouds is 0.4; the posterior probability of rain then shifts upward to 0.2 to reflect this evidence, computed via the discrete formula above. Such single-observation updates form the foundation for incorporating additional data through repeated application of .

Multiple Observations

In Bayesian inference, the framework for incorporating multiple observations extends the single-observation case by combining evidence from several data points to update the prior distribution on the parameter θ\theta. For nn independent and identically distributed (i.i.d.) observations x1,,xnx_1, \dots, x_n, the posterior distribution is given by p(θx1,,xn)[i=1np(xiθ)]p(θ),p(\theta \mid x_1, \dots, x_n) \propto \left[ \prod_{i=1}^n p(x_i \mid \theta) \right] p(\theta), where the likelihood term factors into a product due to the i.i.d. assumption, reflecting how each observation contributes multiplicatively to the for θ\theta. This formulation scales the single-observation update, where the posterior is proportional to the prior times one likelihood, to a batch of data, enabling efficient incorporation of accumulated . The i.i.d. assumption—that the observations are independent conditional on θ\theta—simplifies the joint likelihood to the product form, making analytical or computational inference tractable in many models, such as those from the exponential family. This conditional independence is a modeling choice, often justified by the data-generating process, but it can be relaxed when observations exhibit dependence; in such cases, the full joint likelihood p(x1,,xnθ)p(x_1, \dots, x_n \mid \theta) is used instead of the product, which may require specifying covariance structures or hierarchical models to capture correlations. For example, in time-series data, autoregressive components can model temporal dependence while still applying Bayes' theorem to the joint distribution. The for the multiple observations, which normalizes the posterior, is p(x1,,xn)=[i=1np(xiθ)]p(θ)dθp(x_1, \dots, x_n) = \int \left[ \prod_{i=1}^n p(x_i \mid \theta) \right] p(\theta) \, d\theta under the i.i.d. assumption, representing the predictive probability of the data averaged over the prior. This , also known as the , plays a key role in via Bayes factors but can be challenging to compute exactly, often approximated using simulation methods like . When accumulating data from multiple sources or repeated experiments, the batch posterior formula allows direct computation using the full product of likelihoods and the initial prior, avoiding the need to iteratively re-derive intermediate posteriors for subsets of the data. This approach is particularly advantageous in large datasets, where the evidence from all observations is combined proportionally without stepwise adjustments, preserving the coherence of belief updating while scaling to practical applications in fields like or .

Sequential Updating

Sequential updating in Bayesian inference involves iteratively refining the posterior distribution as new observations arrive over time, enabling a dynamic incorporation of . The core mechanism is the recursive application of , where the posterior at time tt, p(θy1:t)p(\theta \mid y_{1:t}), is proportional to the likelihood of the new observation yty_t given the parameter θ\theta, multiplied by the posterior from the previous step p(θy1:t1)p(\theta \mid y_{1:t-1}). Formally, p(θy1:t)p(ytθ,y1:t1)p(θy1:t1),p(\theta \mid y_{1:t}) \propto p(y_t \mid \theta, y_{1:t-1}) \cdot p(\theta \mid y_{1:t-1}), assuming the observations are conditionally independent given θ\theta. This form treats the previous posterior as the prior for the current update, allowing beliefs to evolve incrementally without recomputing from the initial prior each time. For independent and identically distributed observations, this sequential process yields the same result as a single batch update using all data at once. The advantages of this recursive approach are pronounced in online learning environments, where streams continuously and computational efficiency is paramount, as it avoids the need to store or reprocess the entire . It supports real-time by providing updated inferences after each new , which is essential for adaptive algorithms that respond to evolving . Additionally, sequential updating is well-suited to dynamic models, where parameters or states change over time, facilitating the tracking of temporal variations through successive refinements of the . These benefits have been demonstrated in large-scale applications, such as cognitive modeling with high-velocity , where incremental updates preserve inferential accuracy while managing constraints. A conceptual example arises in time series filtering, where sequential updating estimates latent states underlying observed data, such as inferring a system's hidden from noisy sequential measurements. At each time step, the current posterior—representing beliefs about the state—serves as the prior, which is then updated with the new observation's likelihood to produce a sharper estimate, progressively reducing as more accumulates. This mirrors in sequential data contexts, emphasizing how each update builds on prior knowledge to form a coherent evolving picture. Despite these strengths, sequential updating presents challenges, particularly in eliciting an appropriate initial prior for long sequences of observations. The of starting prior can influence early updates disproportionately if data is sparse initially, and even as subsequent data dominates, misspecification may introduce subtle biases that propagate through the chain. Careful expert elicitation is thus crucial to ensure the prior reflects genuine uncertainty without unduly skewing long-term posteriors, a that requires structured methods to aggregate reliably.

Formal Framework

Definitions and Notation

In the Bayesian framework for parametric statistical models, the unknown are elements θ of a Θ, typically a of ℝᵖ for some p, while the observed consist of realizations x from an observable space X, which may be discrete, continuous, or mixed. The prior distribution encodes initial uncertainty about θ via a π on Θ, which in the continuous case is specified by a π(θ) with respect to a dominating measure (such as ), and in the discrete case by a . The is the measure of x given θ, denoted f(x|θ), which serves as the or mass function of the sampling distribution x ~ f(·|θ). Distinctions between densities and probabilities arise depending on the nature of the spaces: for continuous X and Θ, π(θ) and f(x|θ) are probability density functions, integrating to 1 over their respective spaces, whereas for discrete cases they are probability mass functions summing to 1. In scenarios involving point masses, such as degenerate priors or discrete components in mixed distributions, the δ_τ(θ) represents a unit point at a specific value τ ∈ Θ, defined such that for any g at τ, ∫ g(θ) δ_τ(θ) dθ = g(τ). The posterior distribution π(θ|x) then combines the prior and likelihood to reflect updated beliefs about θ after observing x, with providing the linkage in the form π(θ|x) ∝ f(x|θ) π(θ). This general setup underpins Bayesian inference in parametric models, where Θ parameterizes the of distributions {f(·|θ) : θ ∈ Θ}.

Posterior Distribution

In Bayesian inference, the posterior distribution represents the updated state of about the unknown parameters θ\theta after observing the xx, synthesizing prior beliefs with the provided by the likelihood. This distribution, denoted π(θx)\pi(\theta \mid x), quantifies the relative plausibility of different values of θ\theta conditional on xx, serving as the foundation for all parameter-focused inferences such as estimating θ\theta or assessing its . The posterior is formally derived from , which states that the joint density of θ\theta and xx factors as p(θ,x)=f(xθ)π(θ)p(\theta, x) = f(x \mid \theta) \pi(\theta), where f(xθ)f(x \mid \theta) is the and π(θ)\pi(\theta) is the prior distribution. The posterior then follows as the conditional density: π(θx)=f(xθ)π(θ)m(x),\pi(\theta \mid x) = \frac{f(x \mid \theta) \pi(\theta)}{m(x)}, with the m(x)=f(xθ)π(θ)dθm(x) = \int f(x \mid \theta) \pi(\theta) \, d\theta acting as the normalizing constant to ensure π(θx)\pi(\theta \mid x) integrates to 1 over θ\theta. This update rule, originally proposed by , proportionally weights the prior by the likelihood and normalizes to produce a proper . Bayesian posteriors can be parametric or non-parametric, differing in the dimensionality and flexibility of the . Parametric posteriors assume θ\theta lies in a finite-dimensional , constraining the form of the distribution (e.g., a normal likelihood with unknown yielding a normal posterior under a normal prior), which facilitates but may impose overly rigid assumptions on the data-generating process. In contrast, non-parametric posteriors operate over infinite-dimensional s, such as distributions indexed by functions or measures (e.g., via priors), enabling adaptive modeling of complex, unspecified structures while maintaining coherent . The posterior's role in inference centers on its use to draw conclusions about θ\theta given xx, such as computing expectations E[θx]\mathbb{E}[\theta \mid x] for point summaries or integrating over it for decision-making under uncertainty, thereby providing a complete probabilistic framework for parameter estimation and hypothesis evaluation.

Predictive Distribution

In Bayesian inference, the predictive distribution for new, unobserved data xx^* given observed data xx is obtained by integrating the likelihood of the new data over the posterior distribution of the parameters θ\theta. This is known as the posterior predictive distribution, formally expressed as p(xx)=p(xθ)π(θx)dθ,p(x^* \mid x) = \int p(x^* \mid \theta) \, \pi(\theta \mid x) \, d\theta, where p(xθ)p(x^* \mid \theta) is the sampling distribution (likelihood) for the new data and π(θx)\pi(\theta \mid x) is the posterior density of the parameters. This formulation marginalizes over the uncertainty in θ\theta, providing a full probabilistic description of future observations that accounts for both data variability and parameter estimation error. The computation of the posterior predictive distribution involves marginalization, which integrates out the parameters from the joint posterior predictive density p(x,θx)=p(xθ)π(θx)p(x^*, \theta \mid x) = p(x^* \mid \theta) \, \pi(\theta \mid x). In practice, this integral is rarely tractable analytically and is typically approximated using simulation methods, such as drawing samples θ(s)\theta^{(s)} from the posterior π(θx)\pi(\theta \mid x) and then generating replicated data x(s)p(xθ(s))x^{*(s)} \sim p(x^* \mid \theta^{(s)}) for s=1,,Ss = 1, \dots, S, yielding an empirical approximation to the distribution. These simulations enable the estimation of predictive quantities like means, variances, or quantiles directly from the sample of x(s)x^{*(s)}. Unlike frequentist plug-in predictions, which substitute a point estimate (e.g., the maximum likelihood estimate) for θ\theta into the likelihood to obtain a predictive distribution p(xθ^)p(x^* \mid \hat{\theta}), the Bayesian posterior predictive averages over the entire posterior, incorporating uncertainty and potentially prior information. This leads to wider predictive intervals in small samples and better for forecasting, as the plug-in approach underestimates variability by treating θ^\hat{\theta} as fixed. The is central to new data in applications such as election outcomes or environmental modeling, where it generates probabilistic predictions by propagating forward. It also facilitates through posterior predictive checks, which compare observed data to simulated replicates from the posterior predictive to assess fit, such as by evaluating discrepancies via test statistics like means or extremes.

Mathematical Properties

Marginalization and Conditioning

In Bayesian inference, marginalization is the process of obtaining the of a of variables by integrating out the others from their distribution, effectively accounting for in those excluded variables. This operation is essential for focusing on quantities of interest while treating others as nuisance parameters. For instance, the , also known as the , for observed x\mathbf{x} under a model parameterized by θ\theta is given by m(x)=f(xθ)π(θ)dθ,m(\mathbf{x}) = \int f(\mathbf{x} \mid \theta) \, \pi(\theta) \, d\theta, where f(xθ)f(\mathbf{x} \mid \theta) is the sampling distribution or likelihood of the data given the parameters, and π(θ)\pi(\theta) is the prior distribution on θ\theta. This integral represents the predictive probability of the data under the prior model and serves as a normalizing constant in Bayes' theorem. The law of total probability provides the foundational justification for marginalization in the Bayesian context, stating that the unconditional density of a variable is the expected value of its conditional density with respect to the marginal density of the conditioning variables. In continuous form, this is p(x)=p(xθ)p(θ)dθ,p(\mathbf{x}) = \int p(\mathbf{x} \mid \theta) \, p(\theta) \, d\theta, which directly corresponds to the evidence computation and extends naturally to discrete cases via summation. By performing marginalization, Bayesian analyses can reduce the dimensionality of high-dimensional parameter spaces, making inference more tractable and interpretable without losing the uncertainty encoded in the integrated variables. Conditioning complements marginalization by restricting probabilities to scenarios consistent with observed or specified conditions, thereby updating beliefs about remaining uncertainties. In Bayesian inference, conditioning on x\mathbf{x} transforms the prior π(θ)\pi(\theta) into the posterior π(θx)\pi(\theta \mid \mathbf{x}) via π(θx)=f(xθ)π(θ)m(x),\pi(\theta \mid \mathbf{x}) = \frac{f(\mathbf{x} \mid \theta) \, \pi(\theta)}{m(\mathbf{x})}, where the denominator is the marginalized . This operation can also apply to subsets of or auxiliary parameters, allowing for targeted updates that incorporate partial information. Together, marginalization and conditioning enable the decomposition of complex joint distributions into manageable components, facilitating and precise probabilistic reasoning in Bayesian models.

Conjugate Priors

In Bayesian inference, a is defined as a of distributions for which the posterior distribution belongs to the same after updating with data from a specified . This property ensures that the posterior can be obtained by simply updating the parameters of the prior, without requiring changes in the distributional form. The concept is particularly useful for distributions in the , where conjugate priors can be constructed to match the sufficient statistics of the likelihood. A classic example is the Beta-Binomial model, where the parameter θ\theta of a Binomial likelihood represents the success probability. The prior is taken as θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta), with proportional to θα1(1θ)β1\theta^{\alpha-1}(1-\theta)^{\beta-1}. For nn independent observations yielding kk successes, the posterior is θdataBeta(α+k,β+nk)\theta \mid \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k). This update interprets α\alpha and β\beta as pseudocounts of prior successes and failures, respectively. Another prominent case is the Normal-Normal conjugate pair, applicable when estimating the of a with known variance. The prior is μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2). Given nn i.i.d. observations x1,,xnN(μ,σ2)x_1, \dots, x_n \sim \mathcal{N}(\mu, \sigma^2) with sample xˉ\bar{x}, the posterior is: μdataN(nσ2xˉ+1σ02μ0nσ2+1σ02, 1nσ2+1σ02).\mu \mid \text{data} \sim \mathcal{N}\left( \frac{\frac{n}{\sigma^2} \bar{x} + \frac{1}{\sigma_0^2} \mu_0}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}}, \ \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} \right). The posterior is a precision-weighted of the prior and sample , while the posterior variance is reduced relative to both. For count data, the Gamma-Poisson model provides conjugacy, with the Poisson rate λ\lambda having prior λGamma(α,β)\lambda \sim \text{Gamma}(\alpha, \beta), density proportional to λα1eβλ\lambda^{\alpha-1} e^{-\beta \lambda}. For nn i.i.d. Poisson observations summing to s=xis = \sum x_i, the posterior is λdataGamma(α+s,β+n)\lambda \mid \text{data} \sim \text{Gamma}(\alpha + s, \beta + n). Here, α\alpha and β\beta act as prior and rate parameters updated by the total counts and exposure time. The primary advantage of conjugate priors lies in their analytical tractability: posteriors, marginal likelihoods, and predictive distributions can often be derived in closed form, avoiding and enabling efficient sequential updating in dynamic models. This is especially beneficial for evidence calculation via marginalization, where the is straightforward to compute. However, conjugate families impose restrictions on the form of prior beliefs, potentially limiting flexibility in capturing complex or data-driven uncertainties, which may require sensitivity analyses to assess robustness.

Asymptotic Behavior

As the sample size nn increases, the Bayesian posterior distribution exhibits desirable asymptotic properties under suitable regularity conditions, ensuring that inference becomes increasingly reliable. A fundamental result is the consistency of the posterior, which states that the posterior probability concentrates on the true parameter value θ0\theta_0 with respect to the data-generating measure, provided the model is well-specified and the prior assigns positive mass to neighborhoods of θ0\theta_0. This property, first established by Doob, implies that the posterior mean and other summaries converge to θ0\theta_0, justifying the use of Bayesian methods for large datasets. Under additional and assumptions, the Bernstein-von Mises provides a more precise : the posterior distribution π(θy)\pi(\theta \mid y) asymptotically approximates a centered at the maximum likelihood estimator θ^n\hat{\theta}_n, with given by the inverse observed In(θ^n)1I_n(\hat{\theta}_n)^{-1}, scaled by nn. Specifically, for θ=θ^n+n1/2u\theta = \hat{\theta}_n + n^{-1/2} u, n(π(θy)N(θ^n,n1In(θ^n)1))0\sqrt{n} (\pi(\theta \mid y) - N(\hat{\theta}_n, n^{-1} I_n(\hat{\theta}_n)^{-1})) \to 0
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.