Recent from talks
Contribute something
Nothing was collected or created yet.
Bayesian linear regression
View on Wikipedia| Part of a series on |
| Bayesian statistics |
|---|
| Posterior = Likelihood × Prior ÷ Evidence |
| Background |
| Model building |
| Posterior approximation |
| Estimators |
| Evidence approximation |
| Model evaluation |
| Part of a series on |
| Regression analysis |
|---|
| Models |
| Estimation |
| Background |
Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as well as other parameters describing the distribution of the regressand) and ultimately allowing the out-of-sample prediction of the regressand (often labelled ) conditional on observed values of the regressors (usually ). The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.
Model setup
[edit]Consider a standard linear regression problem, in which for we specify the mean of the conditional distribution of given a predictor vector :
where is a vector, and the are independent and identically normally distributed random variables:
This corresponds to the following likelihood function:
The ordinary least squares solution is used to estimate the coefficient vector using the Moore–Penrose pseudoinverse:
where is the design matrix, each row of which is a predictor vector ; and is the column -vector .
This is a frequentist approach, and it assumes that there are enough measurements to say something meaningful about . In the Bayesian approach,[1] the data is supplemented with additional information in the form of a prior probability distribution. The prior belief about the parameters is combined with the data's likelihood function according to Bayes' theorem to yield the posterior belief about the parameters and . The prior can take different functional forms depending on the domain and the information that is available a priori.
Since the data comprises both and , the focus only on the distribution of conditional on needs justification. In fact, a "full" Bayesian analysis would require a joint likelihood along with a prior , where symbolizes the parameters of the distribution for . Only under the assumption of (weak) exogeneity can the joint likelihood be factored into .[2] The latter part is usually ignored under the assumption of disjoint parameter sets. More so, under classic assumptions is considered chosen (for example, in a designed experiment) and therefore has a known probability without parameters.[3]
With conjugate priors
[edit]Conjugate prior distribution
[edit]For an arbitrary prior distribution, there may be no analytical solution for the posterior distribution. In this section, we will consider a so-called conjugate prior for which the posterior distribution can be derived analytically.
A prior is conjugate to this likelihood function if it has the same functional form with respect to and . Since the log-likelihood is quadratic in , the log-likelihood is re-written such that the likelihood becomes normal in . Write
The likelihood is now re-written as where where is the number of regression coefficients.
This suggests a form for the prior: where is an inverse-gamma distribution
In the notation introduced in the inverse-gamma distribution article, this is the density of an distribution with and with and as the prior values of and , respectively. Equivalently, it can also be described as a scaled inverse chi-squared distribution,
Further the conditional prior density is a normal distribution,
In the notation of the normal distribution, the conditional prior distribution is
Posterior distribution
[edit]With the prior now specified, the posterior distribution can be expressed as
With some re-arrangement,[4] the posterior can be re-written so that the posterior mean of the parameter vector can be expressed in terms of the least squares estimator and the prior mean , with the strength of the prior indicated by the prior precision matrix
To justify that is indeed the posterior mean, the quadratic terms in the exponential can be re-arranged as a quadratic form in .[5]
Now the posterior can be expressed as a normal distribution times an inverse-gamma distribution:
Therefore, the posterior distribution can be parametrized as follows. where the two factors correspond to the densities of and distributions, with the parameters of these given by
which illustrates Bayesian inference being a compromise between the information contained in the prior and the information contained in the sample.
Model evidence
[edit]The model evidence is the probability of the data given the model . It is also known as the marginal likelihood, and as the prior predictive density. Here, the model is defined by the likelihood function and the prior distribution on the parameters, i.e. . The model evidence captures in a single number how well such a model explains the observations. The model evidence of the Bayesian linear regression model presented in this section can be used to compare competing linear models by Bayes factors. These models may differ in the number and values of the predictor variables as well as in their priors on the model parameters. Model complexity is already taken into account by the model evidence, because it marginalizes out the parameters by integrating over all possible values of and . This integral can be computed analytically and the solution is given in the following equation.[6]
Here denotes the gamma function. Because we have chosen a conjugate prior, the marginal likelihood can also be easily computed by evaluating the following equality for arbitrary values of and .[7] Note that this equation follows from a re-arrangement of Bayes' theorem. Inserting the formulas for the prior, the likelihood, and the posterior and simplifying the resulting expression leads to the analytic expression given above.
Other cases
[edit]In general, it may be impossible or impractical to derive the posterior distribution analytically. However, it is possible to approximate the posterior by an approximate Bayesian inference method such as Monte Carlo sampling,[8] INLA or variational Bayes.
The special case is called ridge regression.
A similar analysis can be performed for the general case of the multivariate regression and part of this provides for Bayesian estimation of covariance matrices: see Bayesian multivariate linear regression.
See also
[edit]- Bayes linear statistics
- Constrained least squares
- Regularized least squares
- Tikhonov regularization
- Spike and slab variable selection
- Bayesian interpretation of kernel regularization
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (August 2011) |
Notes
[edit]- ^ Huang, Yunfei; Gompper, Gerhard; Sabass, Benedikt (2020). "A Bayesian traction force microscopy method with automated denoising in a user-friendly software package". Computer Physics Communications. 256 107313. arXiv:2005.01377. Bibcode:2020CoPhC.25607313H. doi:10.1016/j.cpc.2020.107313.
- ^ See Jackman (2009), p. 101.
- ^ See Gelman et al. (2013), p. 354.
- ^ The intermediate steps of this computation can be found in O'Hagan (1994) at the beginning of the chapter on Linear models.
- ^ The intermediate steps are in Fahrmeir et al. (2009) on page 188.
- ^ The intermediate steps of this computation can be found in O'Hagan (1994) on page 257.
- ^ Chib, Siddhartha (1995). "Marginal Likelihood from the Gibbs Output". Journal of the American Statistical Association. 90 (432): 1313–1321. doi:10.2307/2291521.
- ^ Carlin and Louis (2008) and Gelman, et al. (2003) explain how to use sampling methods for Bayesian linear regression.
References
[edit]- Box, G. E. P.; Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis. Wiley. ISBN 0-471-57428-7.
- Carlin, Bradley P.; Louis, Thomas A. (2008). Bayesian Methods for Data Analysis (Third ed.). Boca Raton, FL: Chapman and Hall/CRC. ISBN 978-1-58488-697-6.
- Fahrmeir, L.; Kneib, T.; Lang, S. (2009). Regression. Modelle, Methoden und Anwendungen (Second ed.). Heidelberg: Springer. doi:10.1007/978-3-642-01837-4. ISBN 978-3-642-01836-7.
- Gelman, Andrew; et al. (2013). "Introduction to regression models". Bayesian Data Analysis (Third ed.). Boca Raton, FL: Chapman and Hall/CRC. pp. 353–380. ISBN 978-1-4398-4095-5.
- Jackman, Simon (2009). "Regression models". Bayesian Analysis for the Social Sciences. Wiley. pp. 99–124. ISBN 978-0-470-01154-6.
- Rossi, Peter E.; Allenby, Greg M.; McCulloch, Robert (2006). Bayesian Statistics and Marketing. John Wiley & Sons. ISBN 0-470-86367-6.
- O'Hagan, Anthony (1994). Bayesian Inference. Kendall's Advanced Theory of Statistics. Vol. 2B (First ed.). Halsted. ISBN 0-340-52922-9.
External links
[edit]- Bayesian estimation of linear models (R programming wikibook). Bayesian linear regression as implemented in R.
Bayesian linear regression
View on GrokipediaIntroduction and Fundamentals
Overview
Bayesian linear regression is a probabilistic approach to modeling the linear relationship between a response variable and one or more predictor variables, where the regression coefficients and noise variance are treated as random variables assigned prior probability distributions. The posterior distribution for these parameters is derived by updating the prior with the data likelihood using Bayes' theorem, yielding a full probability distribution that encapsulates uncertainty in the model. This framework extends classical linear regression by incorporating prior knowledge and providing a coherent mechanism for inference under uncertainty.[2] The origins of Bayesian linear regression can be traced to Carl Friedrich Gauss's 1809 work on least squares estimation, which he justified using principles of inverse probability akin to Bayesian reasoning, laying early groundwork for treating parameters probabilistically. In the 20th century, the approach gained prominence through contributions like those of Dennis V. Lindley and Adrian F. M. Smith, who in 1972 developed Bayes estimates for the linear model, emphasizing exchangeability and prior elicitation to enhance statistical inference. These developments built on foundational ideas from Thomas Bayes and Pierre-Simon Laplace, adapting them to regression contexts for more robust handling of data scarcity and expert knowledge.[5][6] Key motivations for adopting Bayesian linear regression over frequentist methods stem from its capacity to address limitations in ordinary least squares (OLS), which yields single point estimates without priors or inherent uncertainty measures. Priors enable regularization to prevent overfitting, particularly in high-dimensional settings, while the posterior supports credible intervals for precise uncertainty quantification and the marginal likelihood facilitates objective model selection and comparison. This results in more flexible and interpretable analyses, especially when prior information from domain expertise is available.[1]Model Formulation
Bayesian linear regression begins with the standard linear model, which posits a linear relationship between the response variable and predictors. Specifically, for observations, the model is given by where is the vector of responses, is the design matrix of predictors (including a column of ones for the intercept), is the vector of regression coefficients, and is the vector of errors with .[7] This formulation assumes that the errors are independent and identically distributed as normal with mean zero and constant variance .[7] The likelihood function for the observed data given the parameters and follows directly from the normality assumption and is expressed as This multivariate normal likelihood encapsulates the contribution of the data to parameter inference in the Bayesian framework.[7] The model relies on several key assumptions: linearity in the parameters, independence of the errors, homoscedasticity (constant variance ), and normality of the errors.[8] Violations of these assumptions, such as heteroscedasticity or non-normality, can affect inference; however, Bayesian approaches often demonstrate robustness to such departures through appropriate prior specifications or alternative error distributions like the t-distribution.[7] For comparison, the frequentist ordinary least squares (OLS) estimator provides a point estimate for by minimizing the residual sum of squares, yielding , with asymptotic variance .[9] This estimator coincides with the posterior mode under noninformative priors in the Bayesian setting but lacks the full uncertainty quantification provided by the posterior distribution.[7]Bayesian Inference Principles
Bayesian inference in linear regression applies Bayes' theorem to update beliefs about the model parameters based on observed data, yielding a posterior distribution that incorporates both prior knowledge and the evidence from the data. For the regression coefficients and error variance , the theorem states that the joint posterior is proportional to the product of the likelihood and the prior: where denotes the response vector, the design matrix, the Gaussian likelihood assuming normally distributed errors, and the prior on the parameters. This formulation allows the posterior to reflect the full uncertainty in and , rather than relying on point estimates.[10] The joint prior is typically specified in a separable manner as , where the conditional prior on given often follows a multivariate normal distribution scaled by the variance, and is a marginal prior such as an inverse-gamma or scaled-inverse-chi-squared distribution. This structure facilitates analytical tractability in conjugate cases and enables the incorporation of domain-specific information, such as regularization through informative priors on .[10] A key output of Bayesian inference is the posterior predictive distribution, which generates probabilistic forecasts for new data at covariate values by marginalizing over the posterior: This integral quantifies prediction uncertainty directly from the model. The advantages of this full distributional approach include the ability to derive credible intervals for parameters and predictions, capturing epistemic and aleatoric uncertainty, as well as shrinkage effects from priors that regularize estimates toward plausible values—particularly beneficial in high dimensions or small samples. In contrast, the maximum a posteriori (MAP) estimate optimizes the posterior for a single point, forgoing the richer uncertainty quantification provided by the full distribution.[10]Prior Distributions
Conjugate Priors
In Bayesian linear regression, the conjugate prior for the model parameters and the noise variance is the normal-inverse-gamma (NIG) distribution, which belongs to the exponential family and matches the form of the normal likelihood to enable analytical posterior updates.[11] The marginal prior on is specified as an inverse-gamma distribution: where and are hyperparameters controlling the prior shape and scale, respectively.[11] Conditional on , the prior on the regression coefficients is multivariate normal: with hyperparameters (a vector, where is the number of predictors), , and (a prior design matrix).[11] The hyperparameters admit natural interpretations in terms of fictitious prior data: represents the prior degrees of freedom (analogous to the effective number of prior observations for the variance estimate), is the prior scale or expected variance based on that data, is the prior mean estimate for , acts as the prior sample size or precision weight, and serves as the design matrix for the imaginary prior dataset that generates the prior covariance structure.[11] This setup allows the prior to encode beliefs derived from previous experiments or domain knowledge, with larger and implying stronger prior influence. The joint prior density is the product of the conditional normal and marginal inverse-gamma densities: This form ensures conjugacy with the Gaussian likelihood, yielding a posterior of the same NIG family.[11] Special cases of the NIG prior include non-informative limits, such as the Jeffreys prior , obtained by taking and (or equivalently, letting the prior precision on diverge), which expresses vague prior knowledge without favoring specific values.[11] Additionally, fixing and using a normal prior on with covariance (a special case where ) yields a posterior mean equivalent to the ridge regression estimator with penalty parameter , providing a Bayesian interpretation of regularization.[12]Non-Conjugate Priors
Non-conjugate priors in Bayesian linear regression refer to prior distributions on model parameters that do not yield analytically tractable posterior distributions when combined with the likelihood, necessitating computational approximation techniques for inference.[13] Unlike conjugate priors, which facilitate exact solutions, non-conjugate priors offer greater flexibility for incorporating complex structures such as sparsity or domain-specific knowledge, making them suitable for modern high-dimensional problems.[14] Prominent examples include sparsity-inducing priors like the horseshoe prior, which places a conditional normal distribution on each regression coefficient with global shrinkage and local shrinkage parameters (half-Cauchy), promoting many coefficients to near zero while allowing a few to remain large.[15] The Laplace prior, equivalent to a double-exponential distribution, induces L1 regularization akin to the lasso and is implemented hierarchically as with , enabling automatic variable selection through shrinkage.[16] Empirical Bayes approaches, which estimate hyperparameters from the data itself (e.g., by maximizing the marginal likelihood), can be applied to various priors, including conjugate ones like Zellner's g-prior with an empirically chosen scaling factor g, to balance shrinkage in high dimensions.[17] These priors excel in high-dimensional settings where the number of predictors exceeds observations, facilitating effective variable selection and improved prediction by concentrating posterior mass on sparse models.[14] They also allow incorporation of domain knowledge through hierarchical structures, such as multilevel priors that model group-level variability in coefficients, enhancing interpretability in contexts like genomics for gene selection or finance for factor modeling under sparsity constraints.[18] Preference for non-conjugate priors over conjugate ones arises in such scenarios, where analytical simplicity is traded for robustness to irrelevant variables and better out-of-sample performance.[13] A key challenge is the resulting posterior intractability, which demands approximation via methods like Markov chain Monte Carlo, though this increases computational cost compared to conjugate cases.[19] Hyperparameters in these priors, such as the global shrinkage in the horseshoe or the rate in the Laplace, are often elicited using empirical Bayes approaches that maximize the marginal likelihood to estimate values from data, or via cross-validation to optimize predictive performance.[20]Posterior Inference
Analytical Posterior Derivation
In Bayesian linear regression, the analytical posterior distribution can be derived exactly when using the conjugate normal-inverse-gamma (NIG) prior for the model parameters and . The model assumes , where . The NIG prior is specified as and , which encodes prior information as if derived from pseudo-observations with design matrix .[11][21] The likelihood is . The joint posterior is proportional to the product of the prior and likelihood. Due to conjugacy, this posterior remains in the NIG family, with updated hyperparameters derived by completing the square in the exponent of the normal components and aggregating scale terms for the inverse-gamma part. The conditional posterior for the coefficients is , where , and the marginal posterior for the variance is .[11][22][21] The posterior hyperparameters are updated as follows: These updates reflect the accumulation of prior pseudo-data with the observed data: increases with sample size , the precision matrix augments the prior information matrix with the data information matrix, weights the prior mean and the ordinary least squares estimator by their respective precisions, and adjusts the scale by the residual sum of squares relative to the posterior mean plus a prior-data discrepancy term.[11][22][21] Integrating out , the marginal posterior for is a multivariate t-distribution: . This arises because the conditional normal for given combined with the inverse-gamma marginal for yields a t-distribution, providing heavier tails than the normal to account for uncertainty in .[11][22][21] Key properties of this posterior include that the mean is a precision-weighted average of the prior mean (weighted by the prior precision ) and the data-based estimator (weighted by ), ensuring shrinkage toward the prior. The posterior variance shrinks with increasing , as grows, concentrating the distribution around and reflecting reduced uncertainty from more observations.[11][22]Computational Methods
When analytical solutions for the posterior distribution are intractable, such as in cases involving non-conjugate priors or high-dimensional settings, computational methods become essential for approximating Bayesian inference in linear regression models. These techniques enable the generation of samples or approximations from the posterior distribution , facilitating uncertainty quantification and model fitting.[23] Markov Chain Monte Carlo (MCMC) methods are a cornerstone for posterior sampling in Bayesian linear regression, particularly when exact derivations are unavailable. For conjugate priors, Gibbs sampling offers an efficient approach by iteratively drawing from the full conditional posterior distributions of the parameters, such as alternating between the regression coefficients given and the variance given .[23] This method leverages the conditional conjugacy to produce a Markov chain that converges to the joint posterior. For more general, non-conjugate cases, the Metropolis-Hastings algorithm is employed, where proposals for parameter updates are accepted or rejected based on the posterior ratio to ensure the chain targets the desired distribution.[24] Implementations of these MCMC techniques are available in probabilistic programming libraries like Stan, which uses Hamiltonian Monte Carlo for efficient sampling, and PyMC, which supports both Gibbs and Metropolis-Hastings via NumPyro backends. Variational inference provides a faster, optimization-based alternative to MCMC by approximating the posterior with a simpler distribution that minimizes the Kullback-Leibler (KL) divergence to the true posterior. A common mean-field approximation assumes a factorized form, such as , where each factor is typically Gaussian, and parameters are updated via coordinate ascent variational Bayes to iteratively optimize the evidence lower bound.[25] This approach scales well to large datasets but introduces bias by enforcing independence among parameters.[26] The Laplace approximation offers a deterministic method for posterior approximation by performing a second-order Taylor expansion of the log-posterior around its mode, the maximum a posteriori (MAP) estimate, yielding a Gaussian distribution centered at the mode with covariance given by the inverse Hessian.[27] This quadratic approximation is particularly useful for moderate-dimensional problems where the posterior is roughly symmetric and unimodal, providing quick estimates of posterior moments without sampling.[27] Assessing convergence and reliability of these approximations is critical, with diagnostics including trace plots to visualize chain mixing and stationarity, as well as the Gelman-Rubin statistic, which compares variances within and between multiple chains to detect non-convergence (values close to 1 indicate convergence).[28] Computational costs rise with sample size or dimension , where MCMC may require thousands of iterations for high-dimensional models, while variational methods and Laplace approximations offer or better scalability but at the expense of accuracy in multimodal posteriors.[28] Modern software tools facilitate these methods, including JAGS for Gibbs sampling via conditional specifications, Stan for advanced MCMC with automatic differentiation, PyMC for flexible Python-based modeling with JAX acceleration as of 2025 updates, and Pyro for scalable inference in deep probabilistic models.Model Evaluation and Predictions
Marginal Likelihood
The marginal likelihood, also known as the model evidence, in Bayesian linear regression is defined as the probability of the observed data given the design matrix and model , obtained by integrating out the parameters and : This integral represents the normalizing constant of the posterior distribution and quantifies the plausibility of the data under the model prior to observing the parameters.[29] Under conjugate priors such as the normal-inverse-gamma (NIG) distribution for , the marginal likelihood admits a closed-form expression as the probability density function of a multivariate Student-t distribution for , with location parameter , degrees of freedom , and scale matrix , where , , , and are the prior mean, covariance scaling matrix, scale, and degrees of freedom parameters of the NIG prior, respectively. This closed form facilitates exact computation when the prior is conjugate, enabling analytical model assessment without simulation. It is derived by completing the square in the joint density of the parameters and data, marginalizing over and .[29][11] The marginal likelihood serves as the basis for Bayes factors, which are ratios of evidences between competing models and support Bayesian model selection. For instance, the Bayes factor comparing two nested linear regression models and is , where values greater than 1 favor and provide a measure of relative evidence strength. This approach is particularly useful in variable selection for linear regression, as it penalizes model complexity through prior integration.[30] When closed-form expressions are unavailable or priors are non-conjugate, the marginal likelihood can be approximated using posterior samples from MCMC methods. The harmonic mean estimator, computed as the reciprocal of the average of the reciprocals of the likelihood evaluated at posterior draws, provides a simple Monte Carlo-based estimate but suffers from high variance. Chib's method offers a more stable alternative by equating the marginal likelihood to the ratio of prior and posterior ordinates at a high-density point, leveraging Gibbs sampling output for accuracy. In high dimensions, such as large-scale linear regression with many predictors, these approximations face challenges due to poor MCMC mixing and the curse of dimensionality, often requiring advanced techniques like bridge sampling for reliable estimates.[31][32]Predictive Distributions
In Bayesian linear regression, the posterior predictive distribution provides a full probabilistic forecast for new observations given inputs , the observed data , and design matrix . It is obtained by integrating the likelihood over the posterior distribution of the parameters:This distribution quantifies both the aleatoric uncertainty from the noise process and the epistemic uncertainty from the parameters, enabling predictions that reflect all sources of variability in the model. Under conjugate priors, such as the normal-inverse-gamma prior on and , the posterior predictive distribution admits a closed-form expression as a multivariate Student-t distribution. Specifically, for a new observation,
where , , , and are updated posterior quantities derived from the data and prior (e.g., degrees of freedom , posterior mean , scale , and precision scalar ). This form arises because the marginal posterior for is Student-t when has an inverse-gamma prior, and convolving with the Gaussian likelihood yields another Student-t. The heavier tails of the Student-t compared to a Gaussian capture the combined uncertainties, making it suitable for small samples or when variance is uncertain.[33][29] Credible intervals derived from the posterior predictive distribution are typically wider than corresponding frequentist confidence intervals for the same data, as they incorporate uncertainty in the parameter estimates and in addition to sampling variability. For instance, in linear models with partially observed outcomes, asymptotic analysis shows that Bayesian credible intervals expand to account for this full posterior variability, providing more conservative forecasts that avoid understating risk. This difference highlights the Bayesian approach's explicit quantification of parameter indeterminacy, which frequentist intervals treat as fixed. When conjugate priors do not apply or for non-conjugate cases, the posterior predictive can be approximated via Monte Carlo methods by drawing samples from the posterior and then simulating for each draw. The empirical distribution of approximates , from which summaries like means, variances, or quantiles can be computed. This simulation-based approach is flexible and scales to complex models using Markov chain Monte Carlo. To evaluate the quality of these predictives, metrics such as calibration (assessing reliability, e.g., via probability integral transform histograms) and sharpness (measuring concentration, e.g., average interval width) are used, often in conjunction with proper scoring rules to ensure the approximations are well-calibrated to observed outcomes.[29]
Extensions and Applications
Advanced Variants
Bayesian linear regression can be extended to hierarchical models, where priors are placed on hyperparameters to account for variability across groups or levels, such as in multilevel regression with random effects. In these models, regression coefficients are treated as random variables drawn from a higher-level distribution, enabling partial pooling of information across subgroups to improve estimates in settings with clustered data. This approach, foundational in works like Gelman and Hill (2007), facilitates modeling of varying intercepts or slopes while incorporating uncertainty in the group-level parameters. In high-dimensional settings where the number of predictors exceeds the sample size, sparsity-inducing priors address variable selection challenges in Bayesian linear regression. The Bayesian lasso, introduced by Park and Casella (2008), employs a Laplace prior on coefficients, equivalent to a scale mixture of normals, which shrinks small coefficients toward zero while providing credible intervals for selection.[16] The horseshoe prior, developed by Carvalho et al. (2009), uses a global-local shrinkage mechanism with half-Cauchy distributions on local scales, achieving aggressive shrinkage for irrelevant variables and minimal shrinkage for important ones, particularly effective in sparse signals.[34] Spike-and-slab priors, as formalized by Ishwaran and Rao (2005), combine a point mass (spike) at zero with a broader slab distribution, enabling exact variable selection through posterior inclusion probabilities, though computationally intensive without approximations. Robust variants of Bayesian linear regression replace Gaussian errors with heavier-tailed distributions to mitigate outlier influence. The Student-t error model, with degrees of freedom as a parameter controlling tail heaviness, provides downweighting of anomalous observations while maintaining conjugacy in certain hierarchical setups, as analyzed in Bayesian frameworks by Geweke (1993).[35] This extension connects to broader generalized linear models but focuses on linear mean structures with robust likelihoods, enhancing inference stability in contaminated data. Time-series extensions incorporate temporal dependence into Bayesian linear regression, such as autoregressive (AR) models where the response depends on lagged values. Bayesian AR(p) models treat AR coefficients as random with priors ensuring stationarity, estimated via MCMC, as in the foundational approach by Chib and Greenberg (1994).[36] In the nonparametric limit, Gaussian processes emerge as infinite-dimensional Bayesian linear regression with a kernel-induced covariance, viewing the function as a draw from a GP prior updated by data, equivalent to kernelized Bayesian regression as detailed by Rasmussen and Williams (2006). Recent advancements post-2020 emphasize scalability for large datasets, particularly through variational Bayes (VB) approximations in Bayesian linear regression. Mean-field VB for high-dimensional sparse regression, as in Ray and Szabo (2021), optimizes tractable posteriors under spike-and-slab priors, achieving faster inference than MCMC while preserving selection accuracy.[37]Practical Applications
Bayesian linear regression finds extensive use in machine learning, particularly for incorporating uncertainty quantification into predictive models, which aids decision-making in scenarios like active learning where the model selects data points to query based on predictive variance, and in reinforcement learning for balancing exploration and exploitation through epistemic uncertainty estimates. For instance, in active learning tasks, the posterior predictive distribution allows for efficient sample selection by prioritizing regions of high uncertainty, improving model performance with fewer labeled examples. A notable connection arises in the limiting case where Bayesian linear regression employs a Gaussian process prior on the regression weights, effectively recovering Gaussian process regression, which extends the framework to non-linear mappings via kernel functions and provides scalable uncertainty estimates for complex data.[38] In scientific domains, Bayesian linear regression supports robust inference under uncertainty. In epidemiology, it enables adjustment for confounders through informative priors on regression coefficients, facilitating causal inference in observational studies with sparse or noisy data, such as modeling disease incidence while accounting for socioeconomic variables. In finance, the approach is applied to risk modeling by placing priors on volatilities and correlations, allowing for probabilistic forecasts of asset returns and value-at-risk estimates that incorporate historical market beliefs to mitigate overfitting in volatile environments. Environmental science leverages Bayesian linear regression in spatial regression models to predict phenomena like pollutant levels, where spatial priors capture autocorrelation and enable interpolation across unsampled locations, enhancing accuracy in geospatial datasets.[39][40][41] Practical case studies illustrate its versatility. In A/B testing, Bayesian linear regression with conjugate priors stabilizes estimates for small sample sizes by shrinking coefficients toward prior means, enabling reliable comparison of treatment effects in online experiments with limited traffic, such as website variant evaluations where traditional methods suffer from high variance. For drug dose-response modeling, it fits hierarchical linear models to quantify efficacy curves across patient subgroups, using priors informed by preclinical data to predict optimal dosages while quantifying uncertainty in therapeutic windows, as demonstrated in meta-analyses of clinical trial outcomes.[42][43] Implementation is facilitated by accessible software packages. In R, the brms package provides a user-friendly interface for fitting Bayesian linear regression models via Stan, supporting conjugate priors for quick inference on simple cases. For example, a conjugate Bayesian linear regression with normal priors can be specified as follows:library(brms)
fit <- brm(y ~ x, data = dataset, family = gaussian(),
prior = prior(normal(0, 10), class = Intercept) +
prior(normal(0, 10), class = b),
control = list(adapt_delta = 0.95))
summary(fit)
library(brms)
fit <- brm(y ~ x, data = dataset, family = gaussian(),
prior = prior(normal(0, 10), class = Intercept) +
prior(normal(0, 10), class = b),
control = list(adapt_delta = 0.95))
summary(fit)
import pymc as pm
import numpy as np
with pm.Model() as model:
alpha = pm.Normal('alpha', mu=0, sigma=10)
beta = pm.Normal('beta', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=1)
mu = alpha + beta * x
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
trace = pm.sample(1000)
pm.summary(trace)
import pymc as pm
import numpy as np
with pm.Model() as model:
alpha = pm.Normal('alpha', mu=0, sigma=10)
beta = pm.Normal('beta', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=1)
mu = alpha + beta * x
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
trace = pm.sample(1000)
pm.summary(trace)
