Hubbry Logo
Bayesian linear regressionBayesian linear regressionMain
Open search
Bayesian linear regression
Community hub
Bayesian linear regression
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Bayesian linear regression
Bayesian linear regression
from Wikipedia

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as well as other parameters describing the distribution of the regressand) and ultimately allowing the out-of-sample prediction of the regressand (often labelled ) conditional on observed values of the regressors (usually ). The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

Model setup

[edit]

Consider a standard linear regression problem, in which for we specify the mean of the conditional distribution of given a predictor vector :

where is a vector, and the are independent and identically normally distributed random variables:

This corresponds to the following likelihood function:

The ordinary least squares solution is used to estimate the coefficient vector using the Moore–Penrose pseudoinverse:

where is the design matrix, each row of which is a predictor vector ; and is the column -vector .

This is a frequentist approach, and it assumes that there are enough measurements to say something meaningful about . In the Bayesian approach,[1] the data is supplemented with additional information in the form of a prior probability distribution. The prior belief about the parameters is combined with the data's likelihood function according to Bayes' theorem to yield the posterior belief about the parameters and . The prior can take different functional forms depending on the domain and the information that is available a priori.

Since the data comprises both and , the focus only on the distribution of conditional on needs justification. In fact, a "full" Bayesian analysis would require a joint likelihood along with a prior , where symbolizes the parameters of the distribution for . Only under the assumption of (weak) exogeneity can the joint likelihood be factored into .[2] The latter part is usually ignored under the assumption of disjoint parameter sets. More so, under classic assumptions is considered chosen (for example, in a designed experiment) and therefore has a known probability without parameters.[3]

With conjugate priors

[edit]

Conjugate prior distribution

[edit]

For an arbitrary prior distribution, there may be no analytical solution for the posterior distribution. In this section, we will consider a so-called conjugate prior for which the posterior distribution can be derived analytically.

A prior is conjugate to this likelihood function if it has the same functional form with respect to and . Since the log-likelihood is quadratic in , the log-likelihood is re-written such that the likelihood becomes normal in . Write

The likelihood is now re-written as where where is the number of regression coefficients.

This suggests a form for the prior: where is an inverse-gamma distribution

In the notation introduced in the inverse-gamma distribution article, this is the density of an distribution with and with and as the prior values of and , respectively. Equivalently, it can also be described as a scaled inverse chi-squared distribution,

Further the conditional prior density is a normal distribution,

In the notation of the normal distribution, the conditional prior distribution is

Posterior distribution

[edit]

With the prior now specified, the posterior distribution can be expressed as

With some re-arrangement,[4] the posterior can be re-written so that the posterior mean of the parameter vector can be expressed in terms of the least squares estimator and the prior mean , with the strength of the prior indicated by the prior precision matrix

To justify that is indeed the posterior mean, the quadratic terms in the exponential can be re-arranged as a quadratic form in .[5]

Now the posterior can be expressed as a normal distribution times an inverse-gamma distribution:

Therefore, the posterior distribution can be parametrized as follows. where the two factors correspond to the densities of and distributions, with the parameters of these given by

which illustrates Bayesian inference being a compromise between the information contained in the prior and the information contained in the sample.

Model evidence

[edit]

The model evidence is the probability of the data given the model . It is also known as the marginal likelihood, and as the prior predictive density. Here, the model is defined by the likelihood function and the prior distribution on the parameters, i.e. . The model evidence captures in a single number how well such a model explains the observations. The model evidence of the Bayesian linear regression model presented in this section can be used to compare competing linear models by Bayes factors. These models may differ in the number and values of the predictor variables as well as in their priors on the model parameters. Model complexity is already taken into account by the model evidence, because it marginalizes out the parameters by integrating over all possible values of and . This integral can be computed analytically and the solution is given in the following equation.[6]

Here denotes the gamma function. Because we have chosen a conjugate prior, the marginal likelihood can also be easily computed by evaluating the following equality for arbitrary values of and .[7] Note that this equation follows from a re-arrangement of Bayes' theorem. Inserting the formulas for the prior, the likelihood, and the posterior and simplifying the resulting expression leads to the analytic expression given above.

Other cases

[edit]

In general, it may be impossible or impractical to derive the posterior distribution analytically. However, it is possible to approximate the posterior by an approximate Bayesian inference method such as Monte Carlo sampling,[8] INLA or variational Bayes.

The special case is called ridge regression.

A similar analysis can be performed for the general case of the multivariate regression and part of this provides for Bayesian estimation of covariance matrices: see Bayesian multivariate linear regression.

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Bayesian linear regression is a probabilistic extension of classical that applies to model the relationship between a dependent variable and one or more independent variables, treating the regression parameters as random variables with specified prior distributions that are updated with observed data to yield posterior distributions for inference. This approach provides a full over the parameters rather than point estimates, enabling quantification of uncertainty in predictions and coefficients. In the standard formulation, the model posits that the observed response vector y\mathbf{y} follows y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, where εN(0,σ2I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}), X\mathbf{X} is the of predictors, β\boldsymbol{\beta} is the vector of coefficients, and σ2\sigma^2 is the noise variance. The likelihood is Gaussian, p(yX,β,σ2)=(2πσ2)n/2exp[12σ2(yXβ)T(yXβ)]p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi \sigma^2)^{-n/2} \exp\left[ -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right], with nn observations. Common conjugate priors include a for β\boldsymbol{\beta} given σ2\sigma^2, such as βσ2N(μ0,σ2V0)\boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma^2 \mathbf{V}_0), and an for σ2\sigma^2, ensuring the posterior is also normal-inverse-gamma for analytical tractability. The resulting posterior distribution for the parameters is p(β,σ2y,X)p(yX,β,σ2)p(β,σ2)p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) p(\boldsymbol{\beta}, \sigma^2), which updates the prior beliefs with the data. Key advantages of Bayesian linear regression over frequentist methods include the incorporation of prior knowledge to regularize estimates and prevent overfitting, particularly in low-data regimes, as the posterior mean for β\boldsymbol{\beta} is a shrinkage estimator weighting the prior mean and the ordinary least squares estimate. It also yields a posterior predictive distribution p(yy,X,X)p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}^*, \mathbf{X}) that marginalizes over the posterior, providing not only point predictions but also credible intervals that account for both parameter and noise uncertainty; in the conjugate case, this is a multivariate Student-t distribution with mean given by the posterior mean of the linear predictor μTψ(x)\boldsymbol{\mu}^T \psi(\mathbf{x}^*) (where ψ\psi denotes basis functions if used) and scale incorporating uncertainties in β\boldsymbol{\beta} and σ2\sigma^2. Inference can be exact with conjugate priors or approximated via Markov chain Monte Carlo for non-conjugate cases, such as shrinkage priors like the Bayesian lasso for sparse high-dimensional settings. This framework extends naturally to generalized linear models and hierarchical structures, making it versatile for complex data analyses.

Introduction and Fundamentals

Overview

Bayesian linear regression is a probabilistic approach to modeling the linear relationship between a response variable and one or more predictor variables, where the regression coefficients and noise variance are treated as random variables assigned distributions. The posterior distribution for these parameters is derived by updating the prior with the data likelihood using , yielding a full that encapsulates in the model. This framework extends classical by incorporating prior knowledge and providing a coherent mechanism for under . The origins of Bayesian linear regression can be traced to Carl Friedrich Gauss's 1809 work on estimation, which he justified using principles of akin to Bayesian reasoning, laying early groundwork for treating parameters probabilistically. In the , the approach gained prominence through contributions like those of Dennis V. Lindley and Adrian F. M. Smith, who in 1972 developed Bayes estimates for the , emphasizing exchangeability and prior elicitation to enhance . These developments built on foundational ideas from and , adapting them to regression contexts for more robust handling of data scarcity and expert knowledge. Key motivations for adopting Bayesian linear regression over frequentist methods stem from its capacity to address limitations in ordinary least squares (OLS), which yields single point estimates without priors or inherent uncertainty measures. Priors enable regularization to prevent , particularly in high-dimensional settings, while the posterior supports credible intervals for precise and the facilitates objective and comparison. This results in more flexible and interpretable analyses, especially when prior information from domain expertise is available.

Model Formulation

Bayesian linear regression begins with the standard linear model, which posits a linear relationship between the response variable and predictors. Specifically, for nn observations, the model is given by y=Xβ+ϵ,\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, where y\mathbf{y} is the n×1n \times 1 vector of responses, X\mathbf{X} is the n×pn \times p design matrix of predictors (including a column of ones for the intercept), β\boldsymbol{\beta} is the p×1p \times 1 vector of regression coefficients, and ϵ\boldsymbol{\epsilon} is the n×1n \times 1 vector of errors with ϵN(0,σ2In)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n). This formulation assumes that the errors are independent and identically distributed as normal with mean zero and constant variance σ2\sigma^2. The likelihood function for the observed data y\mathbf{y} given the parameters β\boldsymbol{\beta} and σ2\sigma^2 follows directly from the normality assumption and is expressed as p(yβ,σ2)=(2πσ2)n/2exp(12σ2(yXβ)(yXβ)).p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right). This multivariate normal likelihood encapsulates the contribution of the data to parameter inference in the Bayesian framework. The model relies on several key assumptions: linearity in the parameters, independence of the errors, homoscedasticity (constant variance σ2\sigma^2), and normality of the errors. Violations of these assumptions, such as heteroscedasticity or non-normality, can affect inference; however, Bayesian approaches often demonstrate robustness to such departures through appropriate prior specifications or alternative error distributions like the t-distribution. For comparison, the frequentist ordinary least squares (OLS) estimator provides a point estimate for β\boldsymbol{\beta} by minimizing the residual sum of squares, yielding β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}, with asymptotic variance σ2(XX)1\sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. This estimator coincides with the posterior mode under noninformative priors in the Bayesian setting but lacks the full uncertainty quantification provided by the posterior distribution.

Bayesian Inference Principles

Bayesian inference in linear regression applies to update beliefs about the model parameters based on observed data, yielding a posterior distribution that incorporates both prior knowledge and the evidence from the data. For the regression coefficients β\boldsymbol{\beta} and error variance σ2\sigma^2, the theorem states that the joint posterior is proportional to the product of the likelihood and the prior: p(β,σ2y,X)p(yβ,σ2,X)p(β,σ2),p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}) \, p(\boldsymbol{\beta}, \sigma^2), where y\mathbf{y} denotes the response vector, X\mathbf{X} the , p(yβ,σ2,X)p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}) the Gaussian likelihood assuming normally distributed errors, and p(β,σ2)p(\boldsymbol{\beta}, \sigma^2) the prior on the parameters. This formulation allows the posterior to reflect the full in β\boldsymbol{\beta} and σ2\sigma^2, rather than relying on point estimates. The joint prior p(β,σ2)p(\boldsymbol{\beta}, \sigma^2) is typically specified in a separable manner as p(βσ2)p(σ2)p(\boldsymbol{\beta} \mid \sigma^2) \, p(\sigma^2), where the conditional prior on β\boldsymbol{\beta} given σ2\sigma^2 often follows a multivariate normal distribution scaled by the variance, and p(σ2)p(\sigma^2) is a marginal prior such as an inverse-gamma or scaled-inverse-chi-squared distribution. This structure facilitates analytical tractability in conjugate cases and enables the incorporation of domain-specific information, such as regularization through informative priors on β\boldsymbol{\beta}. A key output of is the , which generates probabilistic forecasts for new data y\mathbf{y}^* at covariate values X\mathbf{X}^* by marginalizing over the posterior: p(yy,X,X)=p(yβ,σ2,X)p(β,σ2y,X)dβdσ2.p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{X}^*) = \int p(\mathbf{y}^* \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}^*) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \, d\boldsymbol{\beta} \, d\sigma^2. This integral quantifies prediction uncertainty directly from the model. The advantages of this full distributional approach include the ability to derive credible intervals for parameters and predictions, capturing epistemic and aleatoric uncertainty, as well as shrinkage effects from priors that regularize estimates toward plausible values—particularly beneficial in high dimensions or small samples. In contrast, the maximum (MAP) estimate optimizes the posterior for a single point, forgoing the richer provided by the full distribution.

Prior Distributions

Conjugate Priors

In Bayesian linear regression, the conjugate prior for the model parameters β\boldsymbol{\beta} and the noise variance σ2\sigma^2 is the normal-inverse-gamma (NIG) distribution, which belongs to the and matches the form of the normal likelihood to enable analytical posterior updates. The marginal prior on σ2\sigma^2 is specified as an : σ2Inverse-Gamma(ν02,s02ν02),\sigma^2 \sim \text{Inverse-Gamma}\left(\frac{\nu_0}{2}, \frac{s_0^2 \nu_0}{2}\right), where ν0>0\nu_0 > 0 and s02>0s_0^2 > 0 are hyperparameters controlling the prior shape and scale, respectively. Conditional on σ2\sigma^2, the prior on the regression coefficients β\boldsymbol{\beta} is multivariate normal: βσ2N(μ0,σ2(λ0X0X0)1),\boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}\left(\boldsymbol{\mu}_0, \sigma^2 (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0)^{-1}\right), with hyperparameters μ0\boldsymbol{\mu}_0 (a p×1p \times 1 vector, where pp is the number of predictors), λ0>0\lambda_0 > 0, and X0\mathbf{X}_0 (a λ0×p\lambda_0 \times p ). The hyperparameters admit natural interpretations in terms of fictitious prior data: ν0\nu_0 represents the prior degrees of freedom (analogous to the effective number of prior observations for the variance estimate), s02s_0^2 is the prior scale or expected variance based on that data, μ0\boldsymbol{\mu}_0 is the prior mean estimate for β\boldsymbol{\beta}, λ0\lambda_0 acts as the prior sample size or precision weight, and X0\mathbf{X}_0 serves as the for the imaginary prior dataset that generates the prior covariance structure. This setup allows the prior to encode beliefs derived from previous experiments or domain knowledge, with larger λ0\lambda_0 and ν0\nu_0 implying stronger prior influence. The joint prior density p(β,σ2)p(\boldsymbol{\beta}, \sigma^2) is the product of the conditional normal and marginal inverse-gamma densities: p(β,σ2)=p(βσ2)p(σ2)=λ0X0X01/2(2π)p/2(σ2)p/2exp[12σ2(βμ0)(λ0X0X0)(βμ0)]×(s02ν02)ν0/2Γ(ν0/2)(σ2)(ν0/2+1)exp[s02ν02σ2].\begin{aligned} p(\boldsymbol{\beta}, \sigma^2) &= p(\boldsymbol{\beta} \mid \sigma^2) \, p(\sigma^2) \\ &= \frac{|\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0|^{1/2}}{(2\pi)^{p/2} (\sigma^2)^{p/2}} \exp\left[ -\frac{1}{2\sigma^2} (\boldsymbol{\beta} - \boldsymbol{\mu}_0)^\top (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0) (\boldsymbol{\beta} - \boldsymbol{\mu}_0) \right] \\ &\quad \times \frac{ \left( \frac{s_0^2 \nu_0}{2} \right)^{\nu_0 / 2} }{ \Gamma(\nu_0 / 2) } (\sigma^2)^{-(\nu_0 / 2 + 1)} \exp\left[ -\frac{s_0^2 \nu_0}{2 \sigma^2} \right]. \end{aligned} This form ensures conjugacy with the Gaussian likelihood, yielding a posterior of the same NIG family. Special cases of the NIG prior include non-informative limits, such as the p(β,σ2)1/σ2p(\boldsymbol{\beta}, \sigma^2) \propto 1/\sigma^2, obtained by taking ν00\nu_0 \to 0 and λ00\lambda_0 \to 0 (or equivalently, letting the prior precision on β\boldsymbol{\beta} diverge), which expresses vague prior knowledge without favoring specific values. Additionally, fixing σ2\sigma^2 and using a normal prior on β\boldsymbol{\beta} with covariance σ2(λ0I)1\sigma^2 (\lambda_0 \mathbf{I})^{-1} (a special case where X0X0=I\mathbf{X}_0^\top \mathbf{X}_0 = \mathbf{I}) yields a posterior equivalent to the estimator with penalty parameter λ0\lambda_0, providing a Bayesian interpretation of regularization.

Non-Conjugate Priors

Non-conjugate priors in Bayesian linear regression refer to prior distributions on model parameters that do not yield analytically tractable posterior distributions when combined with the likelihood, necessitating computational techniques for . Unlike conjugate priors, which facilitate exact solutions, non-conjugate priors offer greater flexibility for incorporating complex structures such as sparsity or domain-specific knowledge, making them suitable for modern high-dimensional problems. Prominent examples include sparsity-inducing priors like the horseshoe prior, which places a conditional on each regression coefficient βjσ2,τ,λjN(0,σ2τ2λj2)\beta_j \mid \sigma^2, \tau, \lambda_j \sim \mathcal{N}(0, \sigma^2 \tau^2 \lambda_j^2) with global shrinkage τ\tau and local shrinkage parameters λjC+(0,1)\lambda_j \sim \text{C}^+(0,1) (half-Cauchy), promoting many coefficients to near zero while allowing a few to remain large. The Laplace prior, equivalent to a double-exponential distribution, induces L1 regularization akin to the and is implemented hierarchically as βjσ2,τj2N(0,σ2τj2)\beta_j \mid \sigma^2, \tau_j^2 \sim \mathcal{N}(0, \sigma^2 \tau_j^2) with τj2Exp(λ2/2)\tau_j^2 \sim \text{Exp}(\lambda^2/2), enabling automatic variable selection through shrinkage. Empirical Bayes approaches, which estimate hyperparameters from the data itself (e.g., by maximizing the ), can be applied to various priors, including conjugate ones like Zellner's g-prior with an empirically chosen scaling factor g, to balance shrinkage in high dimensions. These priors excel in high-dimensional settings where the number of predictors exceeds observations, facilitating effective variable selection and improved prediction by concentrating posterior mass on sparse models. They also allow incorporation of through hierarchical structures, such as multilevel priors that model group-level variability in coefficients, enhancing interpretability in contexts like for gene selection or for factor modeling under sparsity constraints. Preference for non-conjugate priors over conjugate ones arises in such scenarios, where analytical simplicity is traded for robustness to irrelevant variables and better out-of-sample performance. A key challenge is the resulting posterior intractability, which demands approximation via methods like , though this increases computational cost compared to conjugate cases. Hyperparameters in these priors, such as the global shrinkage τ\tau in the horseshoe or the rate λ\lambda in the Laplace, are often elicited using empirical Bayes approaches that maximize the to estimate values from data, or via cross-validation to optimize predictive performance.

Posterior Inference

Analytical Posterior Derivation

In Bayesian linear regression, the analytical posterior distribution can be derived exactly when using the conjugate normal-inverse-gamma (NIG) prior for the model parameters β\boldsymbol{\beta} and σ2\sigma^2. The model assumes y=Xβ+ϵ\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, where ϵN(0,σ2In)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n). The NIG prior is specified as βσ2N(μ0,σ2(λ0X0X0)1)\boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma^2 (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0)^{-1}) and σ2Inverse-Gamma(ν0/2,s02ν0/2)\sigma^2 \sim \text{Inverse-Gamma}(\nu_0/2, s_0^2 \nu_0 / 2), which encodes prior information as if derived from λ0\lambda_0 pseudo-observations with X0\mathbf{X}_0. The likelihood is yβ,σ2,XN(Xβ,σ2In)\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X} \sim \mathcal{N}(\mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n). The joint posterior p(β,σ2y,X)p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) is proportional to the product of the prior and likelihood. Due to conjugacy, this posterior remains in the NIG family, with updated hyperparameters derived by in the exponent of the normal components and aggregating scale terms for the inverse-gamma part. The conditional posterior for the coefficients is βσ2,y,XN(μn,σ2Bn1)\boldsymbol{\beta} \mid \sigma^2, \mathbf{y}, \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_n, \sigma^2 \mathbf{B}_n^{-1}), where Bn=λ0X0X0+XX\mathbf{B}_n = \lambda_0 \mathbf{X}_0^\top \mathbf{X}_0 + \mathbf{X}^\top \mathbf{X}, and the marginal posterior for the variance is σ2y,XInverse-Gamma(νn/2,sn2νn/2)\sigma^2 \mid \mathbf{y}, \mathbf{X} \sim \text{Inverse-Gamma}(\nu_n/2, s_n^2 \nu_n / 2). The posterior hyperparameters are updated as follows: Bn=λ0X0X0+XX,μn=Bn1(λ0X0X0μ0+XXβ^),\mathbf{B}_n = \lambda_0 \mathbf{X}_0^\top \mathbf{X}_0 + \mathbf{X}^\top \mathbf{X}, \quad \boldsymbol{\mu}_n = \mathbf{B}_n^{-1} (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0 \boldsymbol{\mu}_0 + \mathbf{X}^\top \mathbf{X} \hat{\boldsymbol{\beta}}), νn=ν0+n,sn2=ν0s02+(yXμn)(yXμn)+λ0(μ0μn)X0X0(μ0μn)νn.\nu_n = \nu_0 + n, \quad s_n^2 = \frac{\nu_0 s_0^2 + (\mathbf{y} - \mathbf{X} \boldsymbol{\mu}_n)^\top (\mathbf{y} - \mathbf{X} \boldsymbol{\mu}_n) + \lambda_0 (\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n)^\top \mathbf{X}_0^\top \mathbf{X}_0 (\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n)}{\nu_n}. These updates reflect the accumulation of prior pseudo-data with the observed data: νn\nu_n increases with sample size nn, the precision matrix Bn\mathbf{B}_n augments the prior information matrix with the data information matrix, μn\boldsymbol{\mu}_n weights the prior and the ordinary least squares estimator β^\hat{\boldsymbol{\beta}} by their respective precisions, and sn2s_n^2 adjusts the scale by the relative to the posterior plus a prior-data discrepancy term. Integrating out σ2\sigma^2, the marginal posterior for β\boldsymbol{\beta} is a multivariate t-distribution: βy,XTνn(μn,sn2Bn1)\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X} \sim \mathcal{T}_{\nu_n} (\boldsymbol{\mu}_n, s_n^2 \mathbf{B}_n^{-1}). This arises because the conditional normal for β\boldsymbol{\beta} given σ2\sigma^2 combined with the inverse-gamma marginal for σ2\sigma^2 yields a t-distribution, providing heavier tails than the normal to account for uncertainty in σ2\sigma^2. Key properties of this posterior include that the μn\boldsymbol{\mu}_n is a precision-weighted of the prior μ0\boldsymbol{\mu}_0 (weighted by the prior precision λ0X0X0\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0) and the data-based β^\hat{\boldsymbol{\beta}} (weighted by XX\mathbf{X}^\top \mathbf{X}), ensuring shrinkage toward the prior. The posterior variance shrinks with increasing nn, as Bn\mathbf{B}_n grows, concentrating the distribution around μn\boldsymbol{\mu}_n and reflecting reduced uncertainty from more observations.

Computational Methods

When analytical solutions for the posterior distribution are intractable, such as in cases involving non-conjugate priors or high-dimensional settings, computational methods become essential for approximating Bayesian inference in linear regression models. These techniques enable the generation of samples or approximations from the posterior distribution p(β,σ2y,X)p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}), facilitating uncertainty quantification and model fitting. Markov Chain Monte Carlo (MCMC) methods are a for posterior sampling in Bayesian linear regression, particularly when exact derivations are unavailable. For conjugate priors, offers an efficient approach by iteratively drawing from the full conditional posterior distributions of the parameters, such as alternating between the regression coefficients β\boldsymbol{\beta} given σ2\sigma^2 and the variance σ2\sigma^2 given β\boldsymbol{\beta}. This method leverages the conditional conjugacy to produce a that converges to the joint posterior. For more general, non-conjugate cases, the Metropolis-Hastings is employed, where proposals for parameter updates are accepted or rejected based on the posterior ratio to ensure the chain targets the desired distribution. Implementations of these MCMC techniques are available in libraries like Stan, which uses for efficient sampling, and PyMC, which supports both Gibbs and Metropolis-Hastings via NumPyro backends. Variational inference provides a faster, optimization-based alternative to MCMC by approximating the posterior with a simpler distribution that minimizes the Kullback-Leibler (KL) divergence to the true posterior. A common mean-field approximation assumes a factorized form, such as q(β,σ2)=q(β)q(σ2)q(\boldsymbol{\beta}, \sigma^2) = q(\boldsymbol{\beta}) q(\sigma^2), where each factor is typically Gaussian, and parameters are updated via coordinate ascent variational Bayes to iteratively optimize the . This approach scales well to large datasets but introduces bias by enforcing independence among parameters. The Laplace approximation offers a deterministic method for posterior by performing a second-order Taylor expansion of the log-posterior around its mode, the maximum a posteriori (MAP) estimate, yielding a Gaussian distribution centered at the mode with given by the inverse Hessian. This quadratic is particularly useful for moderate-dimensional problems where the posterior is roughly symmetric and unimodal, providing quick estimates of posterior moments without sampling. Assessing convergence and reliability of these approximations is critical, with diagnostics including trace plots to visualize chain mixing and stationarity, as well as the Gelman-Rubin statistic, which compares variances within and between multiple to detect non-convergence (values close to 1 indicate convergence). Computational costs rise with sample size nn or pp, where MCMC may require thousands of iterations for high-dimensional models, while variational methods and Laplace approximations offer O(np2)O(n p^2) or better but at the expense of accuracy in multimodal posteriors. Modern software tools facilitate these methods, including JAGS for Gibbs sampling via conditional specifications, Stan for advanced MCMC with , PyMC for flexible Python-based modeling with acceleration as of 2025 updates, and Pyro for scalable inference in deep probabilistic models.

Model Evaluation and Predictions

Marginal Likelihood

The , also known as the model evidence, in Bayesian linear regression is defined as the probability of the observed data y\mathbf{y} given the design matrix X\mathbf{X} and model M\mathcal{M}, obtained by integrating out the parameters β\boldsymbol{\beta} and σ2\sigma^2: p(yX,M)=p(yβ,σ2,X,M)p(β,σ2X,M)dβdσ2.p(\mathbf{y} \mid \mathbf{X}, \mathcal{M}) = \int p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}, \mathcal{M}) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{X}, \mathcal{M}) \, d\boldsymbol{\beta} \, d\sigma^2. This integral represents the normalizing constant of the posterior distribution and quantifies the plausibility of the data under the model prior to observing the parameters. Under conjugate priors such as the normal-inverse-gamma (NIG) distribution for (β,σ2)(\boldsymbol{\beta}, \sigma^2), the marginal likelihood admits a closed-form expression as the probability density function of a multivariate Student-t distribution for y\mathbf{y}, with location parameter Xμ0\mathbf{X} \boldsymbol{\mu}_0, degrees of freedom ν0\nu_0, and scale matrix s02(In+XV0X)s_0^2 (\mathbf{I}_n + \mathbf{X} \mathbf{V}_0 \mathbf{X}^\top), where μ0\boldsymbol{\mu}_0, V0\mathbf{V}_0, s02s_0^2, and ν0\nu_0 are the prior mean, covariance scaling matrix, scale, and degrees of freedom parameters of the NIG prior, respectively. This closed form facilitates exact computation when the prior is conjugate, enabling analytical model assessment without simulation. It is derived by completing the square in the joint density of the parameters and data, marginalizing over β\boldsymbol{\beta} and σ2\sigma^2. The marginal likelihood serves as the basis for Bayes factors, which are ratios of evidences between competing models and support Bayesian model selection. For instance, the Bayes factor comparing two nested linear regression models M1\mathcal{M}_1 and M0\mathcal{M}_0 is BF10=p(yX,M1)/p(yX,M0)BF_{10} = p(\mathbf{y} \mid \mathbf{X}, \mathcal{M}_1) / p(\mathbf{y} \mid \mathbf{X}, \mathcal{M}_0), where values greater than 1 favor M1\mathcal{M}_1 and provide a measure of relative evidence strength. This approach is particularly useful in variable selection for linear regression, as it penalizes model complexity through prior integration. When closed-form expressions are unavailable or priors are non-conjugate, the marginal likelihood can be approximated using posterior samples from MCMC methods. The estimator, computed as the reciprocal of the average of the reciprocals of the likelihood evaluated at posterior draws, provides a simple Monte Carlo-based estimate but suffers from high variance. Chib's method offers a more stable alternative by equating the marginal likelihood to the ratio of prior and posterior ordinates at a high-density point, leveraging output for accuracy. In high dimensions, such as large-scale with many predictors, these approximations face challenges due to poor MCMC mixing and the curse of dimensionality, often requiring advanced techniques like bridge sampling for reliable estimates.

Predictive Distributions

In Bayesian linear regression, the posterior predictive distribution provides a full probabilistic forecast for new observations y\mathbf{y}^* given inputs x\mathbf{x}^*, the observed data y\mathbf{y}, and X\mathbf{X}. It is obtained by integrating the likelihood over the posterior distribution of the parameters:
p(yy,X,x)=N(yxβ,σ2)p(β,σ2y,X)dβdσ2.p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{x}^*) = \int \mathcal{N}(\mathbf{y}^* \mid \mathbf{x}^{* \top} \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \, d\boldsymbol{\beta} \, d\sigma^2.
This distribution quantifies both the aleatoric uncertainty from the noise process and the epistemic from the parameters, enabling predictions that reflect all sources of variability in the model.
Under conjugate priors, such as the normal-inverse-gamma prior on β\boldsymbol{\beta} and σ2\sigma^2, the posterior predictive distribution admits a closed-form expression as a multivariate Student-t distribution. Specifically, for a new observation,
yy,X,xtνn(xμn,sn2(1+x(λnXnXn)1x)),\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{x}^* \sim t_{\nu_n} \left( \mathbf{x}^{* \top} \boldsymbol{\mu}_n, \, s_n^2 \left(1 + \mathbf{x}^{* \top} (\lambda_n \mathbf{X}_n^\top \mathbf{X}_n)^{-1} \mathbf{x}^* \right) \right),
where νn\nu_n, μn\boldsymbol{\mu}_n, sn2s_n^2, and λn\lambda_n are updated posterior quantities derived from the data and prior (e.g., degrees of freedom νn=ν0+n\nu_n = \nu_0 + n, posterior mean μn\boldsymbol{\mu}_n, scale sn2s_n^2, and precision scalar λn\lambda_n). This form arises because the marginal posterior for β\boldsymbol{\beta} is Student-t when σ2\sigma^2 has an inverse-gamma prior, and convolving with the Gaussian likelihood yields another Student-t. The heavier tails of the Student-t compared to a Gaussian capture the combined uncertainties, making it suitable for small samples or when variance is uncertain.
Credible intervals derived from the are typically wider than corresponding frequentist intervals for the same data, as they incorporate in the estimates β\boldsymbol{\beta} and σ2\sigma^2 in addition to sampling variability. For instance, in linear models with partially observed outcomes, shows that Bayesian credible intervals expand to account for this full posterior variability, providing more conservative forecasts that avoid understating . This difference highlights the Bayesian approach's explicit quantification of indeterminacy, which frequentist intervals treat as fixed. When conjugate priors do not apply or for non-conjugate cases, the posterior predictive can be approximated via methods by drawing samples {β(s),σ2(s)}s=1S\{\boldsymbol{\beta}^{(s)}, \sigma^{2(s)}\}_{s=1}^S from the posterior p(β,σ2y,X)p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) and then simulating y(s)N(xβ(s),σ2(s))\mathbf{y}^{*(s)} \sim \mathcal{N}(\mathbf{x}^{* \top} \boldsymbol{\beta}^{(s)}, \sigma^{2(s)}) for each draw. The empirical distribution of {y(s)}\{\mathbf{y}^{*(s)}\} approximates p(yy,X,x)p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{x}^*), from which summaries like means, variances, or quantiles can be computed. This simulation-based approach is flexible and scales to complex models using . To evaluate the quality of these predictives, metrics such as (assessing reliability, e.g., via probability integral transform histograms) and sharpness (measuring concentration, e.g., average interval width) are used, often in conjunction with proper scoring rules to ensure the approximations are well-calibrated to observed outcomes.

Extensions and Applications

Advanced Variants

Bayesian linear regression can be extended to hierarchical models, where priors are placed on hyperparameters to account for variability across groups or levels, such as in multilevel regression with random effects. In these models, regression coefficients are treated as random variables drawn from a higher-level distribution, enabling partial pooling of information across subgroups to improve estimates in settings with clustered data. This approach, foundational in works like Gelman and Hill (2007), facilitates modeling of varying intercepts or slopes while incorporating uncertainty in the group-level parameters. In high-dimensional settings where the number of predictors exceeds the sample size, sparsity-inducing priors address variable selection challenges in Bayesian linear regression. The Bayesian lasso, introduced by Park and Casella (2008), employs a Laplace prior on coefficients, equivalent to a scale mixture of normals, which shrinks small coefficients toward zero while providing credible intervals for selection. The horseshoe prior, developed by Carvalho et al. (2009), uses a global-local shrinkage mechanism with half-Cauchy distributions on local scales, achieving aggressive shrinkage for irrelevant variables and minimal shrinkage for important ones, particularly effective in sparse signals. Spike-and-slab priors, as formalized by Ishwaran and Rao (2005), combine a point mass (spike) at zero with a broader slab distribution, enabling exact variable selection through posterior inclusion probabilities, though computationally intensive without approximations. Robust variants of Bayesian linear regression replace Gaussian errors with heavier-tailed distributions to mitigate influence. The Student-t error model, with as a controlling tail heaviness, provides downweighting of anomalous observations while maintaining conjugacy in certain hierarchical setups, as analyzed in Bayesian frameworks by Geweke (1993). This extension connects to broader generalized linear models but focuses on linear structures with robust likelihoods, enhancing stability in contaminated data. Time-series extensions incorporate temporal dependence into Bayesian linear regression, such as autoregressive (AR) models where the response depends on lagged values. Bayesian AR(p) models treat AR coefficients as random with priors ensuring stationarity, estimated via MCMC, as in the foundational approach by Chib and Greenberg (1994). In the nonparametric limit, Gaussian processes emerge as infinite-dimensional Bayesian linear regression with a kernel-induced covariance, viewing the function as a draw from a GP prior updated by data, equivalent to kernelized Bayesian regression as detailed by and Williams (2006). Recent advancements post-2020 emphasize scalability for large datasets, particularly through variational Bayes (VB) approximations in . Mean-field VB for high-dimensional sparse regression, as in Ray and Szabo (2021), optimizes tractable posteriors under spike-and-slab priors, achieving faster inference than MCMC while preserving selection accuracy.

Practical Applications

Bayesian linear regression finds extensive use in , particularly for incorporating into predictive models, which aids in scenarios like where the model selects data points to query based on predictive variance, and in for balancing and exploitation through epistemic estimates. For instance, in tasks, the allows for efficient sample selection by prioritizing regions of high , improving model performance with fewer labeled examples. A notable connection arises in the limiting case where Bayesian linear regression employs a prior on the regression weights, effectively recovering regression, which extends the framework to non-linear mappings via kernel functions and provides scalable estimates for complex data. In scientific domains, Bayesian linear regression supports robust under uncertainty. In , it enables adjustment for confounders through informative priors on regression coefficients, facilitating in observational studies with sparse or noisy data, such as modeling disease incidence while accounting for socioeconomic variables. In , the approach is applied to modeling by placing priors on volatilities and correlations, allowing for probabilistic forecasts of asset returns and value-at-risk estimates that incorporate historical market beliefs to mitigate in volatile environments. Environmental science leverages Bayesian linear regression in spatial regression models to predict phenomena like pollutant levels, where spatial priors capture and enable across unsampled locations, enhancing accuracy in geospatial datasets. Practical case studies illustrate its versatility. In A/B testing, Bayesian linear regression with conjugate priors stabilizes estimates for small sample sizes by shrinking coefficients toward prior means, enabling reliable comparison of treatment effects in online experiments with limited traffic, such as website variant evaluations where traditional methods suffer from high variance. For drug dose-response modeling, it fits hierarchical linear models to quantify efficacy curves across patient subgroups, using priors informed by preclinical data to predict optimal dosages while quantifying uncertainty in therapeutic windows, as demonstrated in meta-analyses of clinical trial outcomes. Implementation is facilitated by accessible software packages. In R, the brms package provides a user-friendly interface for fitting Bayesian linear regression models via Stan, supporting conjugate priors for quick inference on simple cases. For example, a conjugate Bayesian linear regression with priors can be specified as follows:

r

library(brms) fit <- brm(y ~ x, data = dataset, family = gaussian(), prior = prior(normal(0, 10), class = Intercept) + prior(normal(0, 10), class = b), control = list(adapt_delta = 0.95)) summary(fit)

library(brms) fit <- brm(y ~ x, data = dataset, family = gaussian(), prior = prior(normal(0, 10), class = Intercept) + prior(normal(0, 10), class = b), control = list(adapt_delta = 0.95)) summary(fit)

This code fits the model and summarizes the posterior. In Python, PyMC offers similar functionality with NumPyro or JAX backends for efficient sampling. A conjugate example using normal likelihood and priors is:

python

import pymc as pm import numpy as np with pm.Model() as model: alpha = pm.Normal('alpha', mu=0, sigma=10) beta = pm.Normal('beta', mu=0, sigma=10) sigma = pm.HalfNormal('sigma', sigma=1) mu = alpha + beta * x y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y) trace = pm.sample(1000) pm.summary(trace)

import pymc as pm import numpy as np with pm.Model() as model: alpha = pm.Normal('alpha', mu=0, sigma=10) beta = pm.Normal('beta', mu=0, sigma=10) sigma = pm.HalfNormal('sigma', sigma=1) mu = alpha + beta * x y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y) trace = pm.sample(1000) pm.summary(trace)

These tools handle posterior inference automatically, often outperforming frequentist alternatives in low-data settings. The practical benefits of Bayesian linear regression include superior handling of small datasets through prior regularization, which prevents , and seamless incorporation of expert knowledge via informative priors, leading to more interpretable and robust models. Empirical comparisons in low-data regimes demonstrate reduced (MSE) compared to ordinary , particularly when priors align with domain expectations, while maintaining comparable performance on larger datasets.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.