Hubbry Logo
Bayesian multivariate linear regressionBayesian multivariate linear regressionMain
Open search
Bayesian multivariate linear regression
Community hub
Bayesian multivariate linear regression
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Bayesian multivariate linear regression
Bayesian multivariate linear regression
from Wikipedia

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

Details

[edit]

Consider a regression problem where the dependent variable to be predicted is not a single real-valued scalar but an m-length vector of correlated real numbers. As in the standard regression setup, there are n observations, where each observation i consists of k−1 explanatory variables, grouped into a vector of length k (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of m related regression problems for each observation i: where the set of errors are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a row vector and the regression coefficient vectors are stacked next to each other, as follows:

The coefficient matrix B is a matrix where the coefficient vectors for each regression problem are stacked horizontally:

The noise vector for each observation i is jointly normal, so that the outcomes for a given observation are correlated:

We can write the entire regression problem in matrix form as: where Y and E are matrices. The design matrix X is an matrix with the observations stacked vertically, as in the standard linear regression setup:

The classical, frequentists linear least squares solution is to simply estimate the matrix of regression coefficients using the Moore-Penrose pseudoinverse:

To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as[1] writing the error in terms of and yields

We seek a natural conjugate prior—a joint density which is of the same functional form as the likelihood. Since the likelihood is quadratic in , we re-write the likelihood so it is normal in (the deviation from classical sample estimate).

Using the same technique as with Bayesian linear regression, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).

First, let us apply sum-of-squares to obtain new expression for the likelihood:

We would like to develop a conditional form for the priors: where is an inverse-Wishart distribution and is some form of normal distribution in the matrix . This is accomplished using the vectorization transformation, which converts the likelihood from a function of the matrices to a function of the vectors .

Write

Let where denotes the Kronecker product of matrices A and B, a generalization of the outer product which multiplies an matrix by a matrix to generate an matrix, consisting of every combination of products of elements from the two matrices.

Then which will lead to a likelihood which is normal in .

With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.

Conjugate prior distribution

[edit]

The natural conjugate prior using the vectorized variable is of the form:[1] where and

Posterior distribution

[edit]

Using the above prior and likelihood, the posterior distribution can be expressed as:[1] where . The terms involving can be grouped (with ) using: with

This now allows us to write the posterior in a more useful form:

This takes the form of an inverse-Wishart distribution times a Matrix normal distribution: and

The parameters of this posterior are given by:

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Bayesian multivariate is a Bayesian statistical framework that generalizes the classical multivariate model to incorporate prior knowledge about parameters, enabling probabilistic for multiple correlated response variables based on a set of predictors. It models the relationship between an n × p response matrix Y and an n × q X through Y = X B + E, where B is the q × p and E is the matrix with rows independently distributed as multivariate normal with zero and Σ. The is derived from the matrix , given by f(YB,Σ)=(2π)np/2Σn/2exp[12tr((YXB)Σ1(YXB))]f(\mathbf{Y} | \mathbf{B}, \Sigma) = (2\pi)^{-np/2} |\Sigma|^{-n/2} \exp\left[ -\frac{1}{2} \operatorname{tr} \left( (\mathbf{Y} - \mathbf{XB})^\top \Sigma^{-1} (\mathbf{Y} - \mathbf{XB}) \right) \right]. In the Bayesian approach, conjugate priors are typically specified for the parameters to facilitate closed-form posterior inference: a matrix normal prior for B conditional on Σ, such as B | Σ ~ MatrixNormal(B_0, Σ, Ω_0), and an for the covariance matrix Σ ~ InverseWishart(ν_0, S_0). This setup, originally developed by Tiao and Zellner in 1964, yields a posterior for B | Σ, data that is also matrix normal and a marginal posterior for Σ that follows an updated , allowing for exact inference without simulation in the conjugate case. The resulting posterior distributions provide full for predictions and parameter estimates, shrinking coefficients toward prior means and accounting for the uncertainty in Σ. Compared to univariate applied separately to each response, the multivariate version explicitly models the correlations among responses via Σ, leading to improved estimation of means and modest gains in accuracy, particularly when responses are highly correlated. It also supports variable selection through priors like spike-and-slab extensions and handles or heavy-tailed errors via robust generalizations. Applications span fields such as for change-point detection in correlated variables, genomics for associating genetic variants with multiple phenotypes, and environmental modeling where responses like and exhibit dependencies. Modern implementations often use for non-conjugate priors or complex extensions, enhancing flexibility in high-dimensional settings.

Introduction

Definition and Scope

Bayesian multivariate linear regression is a probabilistic framework for modeling the relationships between multiple predictor variables and several response variables simultaneously, treating the model parameters as random variables whose distributions are updated using based on observed data. In this approach, an n×qn \times q matrix of response variables Y\mathbf{Y}, where nn denotes the number of observations and q>1q > 1 the number of responses, is expressed as Y=XB+E\mathbf{Y} = \mathbf{XB} + \mathbf{E}, with X\mathbf{X} an n×pn \times p of predictors, B\mathbf{B} a p×qp \times q matrix of regression coefficients, and E\mathbf{E} an error matrix whose rows are independently distributed as multivariate normal with vector zero and q×qq \times q Σ\Sigma. Priors are specified for B\mathbf{B} and Σ\Sigma, enabling the derivation of posterior distributions that incorporate both prior beliefs and data evidence. The scope of Bayesian multivariate linear regression encompasses scenarios involving correlated multiple responses, such as in , , or , where joint modeling captures interdependencies among outcomes that univariate approaches cannot address. Unlike scalar-response models, it allows for shared predictors across responses while accounting for the full structure in Σ\Sigma, facilitating analyses like simultaneous equation systems or . This distinguishes it from classical multivariate regression, which relies on point estimates from . A primary for adopting the is to overcome limitations of classical methods, which provide only point estimates and require additional approximations for , by instead yielding full posterior distributions for parameters and predictions. This enables rigorous in joint inferences across responses and incorporates prior knowledge to induce shrinkage, reducing variability in coefficient estimates particularly in high-dimensional or small-sample settings. Such features enhance predictive accuracy and interpretability in multivariate contexts.

Relation to Classical and Univariate Bayesian Regression

Bayesian multivariate linear regression extends the classical approach to multivariate multiple regression, where the latter relies on ordinary least squares (OLS) to derive point estimates for the regression coefficient matrix and for the residual , yielding deterministic predictions without inherent . In contrast, the Bayesian framework incorporates prior distributions on the parameters, resulting in full posterior distributions that facilitate regularization through shrinkage toward prior means, credible intervals for parameters and predictions, and explicit incorporation of uncertainty in multivariate responses. This allows for more robust inference in scenarios with limited data or among predictors, where classical OLS can produce unstable estimates. The model generalizes univariate , which assumes a scalar response (corresponding to the case of one response variable, q=1) and a scalar variance instead of the full Σ. In the multivariate setting, it accommodates correlated responses through the matrix-valued coefficient B and the positive definite Σ, enabling joint modeling of multiple outcomes while allowing priors that can be specified conditionally on rows or columns of B to capture dependencies across responses or predictors. This extension preserves the structure from the univariate case but scales it to matrix forms, such as the matrix normal prior for B conditional on Σ, providing a unified framework for handling vector-valued observations. Bayesian multivariate linear regression emerged in the 1960s as an extension of univariate Bayesian methods, with foundational contributions including George C. Tiao and Arnold Zellner's 1964 work on Bayesian estimation for systems of correlated regression equations and the comprehensive treatment in and Tiao's 1973 book on , which built upon Lindley's advocacy for Bayesian approaches to linear models.

Model Specification

Likelihood Function

In Bayesian multivariate linear regression, the likelihood function describes the probability of the observed response matrix Y\mathbf{Y} (an n×pn \times p matrix, where nn is the number of observations and pp is the number of response variables) given the design matrix X\mathbf{X} (an n×qn \times q matrix, with qq predictors), the coefficient matrix B\mathbf{B} (a q×pq \times p matrix), and the p×pp \times p positive definite covariance matrix Σ\boldsymbol{\Sigma}. The model assumes that the errors E=YXB\mathbf{E} = \mathbf{Y} - \mathbf{XB} have rows that are independently distributed as multivariate normal, eiMVN(0,Σ)\mathbf{e}_i \sim \mathcal{MVN}(\mathbf{0}, \boldsymbol{\Sigma}) for i=1,,ni = 1, \dots, n, where Σ\boldsymbol{\Sigma} is shared across all observations and captures the covariance structure among the pp response variables. This independence across rows implies that the joint likelihood factors as a product over individual observations: L(YX,B,Σ)=i=1n(2π)p/2Σ1/2exp{12(yixiB)Σ1(yixiB)},L(\mathbf{Y} \mid \mathbf{X}, \mathbf{B}, \boldsymbol{\Sigma}) = \prod_{i=1}^n (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left\{ -\frac{1}{2} (\mathbf{y}_i - \mathbf{x}_i \mathbf{B})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{y}_i - \mathbf{x}_i \mathbf{B}) \right\}, where yi\mathbf{y}_i^\top and xi\mathbf{x}_i^\top denote the ii-th rows of Y\mathbf{Y} and X\mathbf{X}, respectively. Equivalently, the entire response matrix Y\mathbf{Y} follows a conditional on the parameters: YB,Σ,XMatrixNormal(XB,In,Σ),\mathbf{Y} \mid \mathbf{B}, \boldsymbol{\Sigma}, \mathbf{X} \sim \text{MatrixNormal}(\mathbf{XB}, \mathbf{I}_n, \boldsymbol{\Sigma}), which yields the compact matrix-form likelihood: L(YX,B,Σ)=(2π)np/2Σn/2exp{12tr[(YXB)(YXB)Σ1]}.L(\mathbf{Y} \mid \mathbf{X}, \mathbf{B}, \boldsymbol{\Sigma}) = (2\pi)^{-np/2} |\boldsymbol{\Sigma}|^{-n/2} \exp\left\{ -\frac{1}{2} \operatorname{tr}\left[ (\mathbf{Y} - \mathbf{XB})^\top (\mathbf{Y} - \mathbf{XB}) \boldsymbol{\Sigma}^{-1} \right] \right\}. This formulation arises from the row-wise and multivariate normality of the residuals. The multivariate normal structure of the likelihood in the residuals facilitates analytical tractability, particularly in conjugate Bayesian settings where it enables closed-form posterior updates for the parameters when paired with appropriate priors.

Parameterization of Parameters

In Bayesian multivariate linear regression, the core parameters consist of the regression coefficient matrix B\mathbf{B} and the residual covariance matrix Σ\mathbf{\Sigma}. The matrix B\mathbf{B} is of dimension q×pq \times p, where qq denotes the number of predictors and pp the number of response variables. Each column of B\mathbf{B} contains the coefficients associated with a single response variable, while each row corresponds to the effects of a specific predictor across all responses. This structure allows B\mathbf{B} to capture how individual predictors influence multiple outcomes simultaneously, enabling the interpretation of shared or differential impacts on the responses. The covariance matrix Σ\mathbf{\Sigma} is a p×pp \times p positive definite matrix that models the joint variability and dependence among the response variables. Its diagonal elements represent the marginal variances of each response, quantifying the uncertainty or spread in predictions for individual outcomes after accounting for predictors. The off-diagonal elements capture the pairwise covariances, reflecting correlations between responses—positive values indicate that deviations in one response tend to align with those in another, while negative values suggest opposing movements. This parameterization facilitates the analysis of interdependencies, such as in economic models where multiple indicators (e.g., GDP growth and inflation) may covary. Additional parameters may include intercepts, which are incorporated by augmenting the predictor matrix X\mathbf{X} with a column of ones, thereby embedding the intercept terms as the first row of B\mathbf{B} for each response. In extensions to handle heteroscedasticity, row- or column-specific variances can be introduced by allowing Σ\mathbf{\Sigma} to vary across observations or groups, though this increases model complexity.

Prior Distributions

Conjugate Priors

In Bayesian multivariate linear regression, the standard conjugate prior setup specifies a joint distribution for the regression B\mathbf{B} (of dimension q×pq \times p, where qq is the number of predictors and pp the number of response variables) and the response Σ\boldsymbol{\Sigma} (of dimension p×pp \times p) that facilitates closed-form posterior inference. Conditional on Σ\boldsymbol{\Sigma}, the prior for B\mathbf{B} follows a : BΣMNq×p(M0,V01,Σ),\mathbf{B} \mid \boldsymbol{\Sigma} \sim \mathcal{MN}_{q \times p}(\mathbf{M}_0, \mathbf{V}_0^{-1}, \boldsymbol{\Sigma}), where M0\mathbf{M}_0 is the q×pq \times p prior mean matrix, and V0\mathbf{V}_0 is the q×qq \times q prior precision matrix capturing beliefs about the predictor directions. Equivalently, this induces a for the vectorized form: vec(B)ΣNqp(vec(M0),ΣV01).\text{vec}(\mathbf{B}) \mid \boldsymbol{\Sigma} \sim \mathcal{N}_{qp}(\text{vec}(\mathbf{M}_0), \boldsymbol{\Sigma} \otimes \mathbf{V}_0^{-1}). The marginal prior for Σ\boldsymbol{\Sigma} is an : ΣIWp(Ψ0,ν0),\boldsymbol{\Sigma} \sim \mathcal{IW}_p(\boldsymbol{\Psi}_0, \nu_0), with scale matrix Ψ0\boldsymbol{\Psi}_0 (p×pp \times p, positive definite) and ν0>p1\nu_0 > p - 1 to ensure a proper prior. This matrix-normal-inverse-Wishart prior is conjugate because the likelihood, which arises from a matrix normal sampling distribution for the data, shares the same exponential family form, yielding a posterior that retains the matrix-normal-inverse-Wishart structure with updated hyperparameters derived from the data. The matrix-normal component induces multivariate normal priors for the rows of B\mathbf{B} conditional on Σ\boldsymbol{\Sigma}, allowing correlations across predictors via V01\mathbf{V}_0^{-1} while modeling dependence among responses through Σ\boldsymbol{\Sigma}. If V01\mathbf{V}_0^{-1} is diagonal, the rows are independent. The hyperparameters encode prior knowledge: M0\mathbf{M}_0 represents the expected regression coefficients, V0\mathbf{V}_0 controls the precision (inverse variance) of the coefficients across predictors (with larger values indicating stronger belief in M0\mathbf{M}_0), Ψ0\boldsymbol{\Psi}_0 scales the expected covariance among responses, and ν0\nu_0 influences the prior concentration around Ψ0/(ν0p1)\boldsymbol{\Psi}_0 / (\nu_0 - p - 1). Noninformative limits can be obtained by letting ν0p+1\nu_0 \to p + 1 and scaling Ψ0\boldsymbol{\Psi}_0 appropriately, though care is needed to maintain propriety.

Flexible and Non-Conjugate Priors

In Bayesian multivariate linear regression, flexible priors extend beyond the restrictive conjugate forms, such as the matrix normal-inverse Wishart, by allowing more adaptable specifications that incorporate prior knowledge about sparsity, shrinkage, or structural constraints on the regression coefficients matrix BB and Σ\Sigma. These priors often lead to non-conjugate posteriors, necessitating numerical methods, but they enable modeling complex dependencies and high-dimensional scenarios where conjugate priors may impose unrealistic assumptions. One common flexible prior on the coefficients places independent normal distributions on the elements of vec(B)\mathrm{vec}(B), with a diagonal that induces ridge-like shrinkage across predictors and responses. This specification promotes uniform regularization while remaining computationally tractable in moderate dimensions, differing from conjugate priors by allowing tunable shrinkage levels via the prior variance. For sparsity induction, hierarchical priors such as the horseshoe can be applied to the elements of BB, where local and global shrinkage parameters adaptively shrink small coefficients toward zero while preserving large ones, effectively performing variable selection in multivariate settings. Multivariate extensions of the horseshoe, including joint priors on BB and the inverse , have shown robust performance in high-dimensional genomic data by adapting to both row and column sparsity patterns in BB. Matrix-variate priors offer further flexibility for structured data; for instance, the matrix-t prior on BB provides heavier tails than , enhancing robustness to outliers in the coefficients by modeling scale mixtures of matrix normals. Similarly, the matrix Bingham prior is suitable for directional or reduced-rank data, as it concentrates probability on orthogonal subspaces of the , facilitating priors on the column space of BB in envelope models that reduce dimensionality while preserving multivariate structure. These priors address limitations of conjugate setups by incorporating geometric constraints relevant to applications like shape analysis or factor models. For the covariance matrix Σ\Sigma, non-conjugate priors like the enable nonparametric modeling of unknown response correlations, allowing the prior to adapt to clustering or low-rank structures in the without assuming a fixed parametric form such as inverse Wishart. This is particularly useful in scenarios with evolving or heterogeneous across observations. further enhance flexibility by data-driven selection of hyperparameters for these priors, estimating shrinkage or mixing weights from marginal likelihoods to balance sparsity and density in high-dimensional multivariate regressions, outperforming fixed conjugate choices in prediction accuracy for genetic and data. The advantages of these flexible priors include the ability to incorporate domain-specific shrinkage, such as across multiple responses in , and robustness to outliers or high dimensionality, where conjugate priors often fail due to their rigidity and sensitivity to prior misspecification. By enabling tailored regularization, they improve interpretability and predictive performance in complex datasets, though at the cost of analytical tractability.

Posterior Distribution

Derivation in the Conjugate Case

In the conjugate case, the prior distribution for the regression coefficient matrix BB (of dimension q×pq \times p) conditional on the covariance matrix Σ\Sigma (of dimension p×pp \times p) is a , BΣMN(M0,V0,Σ)B \mid \Sigma \sim \mathrm{MN}(M_0, V_0, \Sigma), where M0M_0 is the prior mean matrix and V0V_0 is the prior row covariance matrix (of dimension q×qq \times q); the marginal prior for Σ\Sigma is inverse Wishart, ΣIW(ν0,Ψ0)\Sigma \sim \mathrm{IW}(\nu_0, \Psi_0), with ν0>p1\nu_0 > p - 1 and scale matrix Ψ0\Psi_0. The joint prior density is thus p(B,Σ)p(BΣ)p(Σ)p(B, \Sigma) \propto p(B \mid \Sigma) \, p(\Sigma), which takes the form p(B,Σ)Σq/2exp(12tr[Σ1(BM0)V01(BM0)])×Σ(ν0+p+1)/2exp(12tr[Σ1Ψ0]).p(B, \Sigma) \propto |\Sigma|^{-q/2} \exp\left( -\frac{1}{2} \operatorname{tr}\left[ \Sigma^{-1} (B - M_0)^\top V_0^{-1} (B - M_0) \right] \right) \times |\Sigma|^{-(\nu_0 + p + 1)/2} \exp\left( -\frac{1}{2} \operatorname{tr}\left[ \Sigma^{-1} \Psi_0 \right] \right). The likelihood for the response matrix YY (of dimension n×pn \times p) given the design matrix XX (of dimension n×qn \times q), BB, and Σ\Sigma assumes independent rows, yielding a matrix normal distribution YB,Σ,XMN(XB,In,Σ)Y \mid B, \Sigma, X \sim \mathrm{MN}(X B, I_n, \Sigma), with density p(YB,Σ,X)Σn/2exp(12tr[Σ1(YXB)(YXB)]).p(Y \mid B, \Sigma, X) \propto |\Sigma|^{-n/2} \exp\left( -\frac{1}{2} \operatorname{tr}\left[ \Sigma^{-1} (Y - X B)^\top (Y - X B) \right] \right). The joint posterior density is then p(B,ΣY,X)p(BΣ)p(Σ)p(YB,Σ,X)p(B, \Sigma \mid Y, X) \propto p(B \mid \Sigma) \, p(\Sigma) \, p(Y \mid B, \Sigma, X). To derive the conditional posterior p(BΣ,Y,X)p(B \mid \Sigma, Y, X), note that it is proportional to p(BΣ)p(YB,Σ,X)p(B \mid \Sigma) \, p(Y \mid B, \Sigma, X), both of which are exponential in quadratic forms of BB. Combining the exponents yields 12tr[Σ1((BM0)V01(BM0)+(YXB)(YXB))].-\frac{1}{2} \operatorname{tr}\left[ \Sigma^{-1} \left( (B - M_0)^\top V_0^{-1} (B - M_0) + (Y - X B)^\top (Y - X B) \right) \right]. Expanding and in BB gives a B(V01+XX)B2B(V01M0+XY)+B^\top (V_0^{-1} + X^\top X) B - 2 B^\top (V_0^{-1} M_0 + X^\top Y) + constant terms, where the posterior precision matrix is Vn1=V01+XXV_n^{-1} = V_0^{-1} + X^\top X and the posterior matrix satisfies Vn1Mn=V01M0+XYV_n^{-1} M_n = V_0^{-1} M_0 + X^\top Y, or equivalently, Mn=Vn(V01M0+XY).M_n = V_n (V_0^{-1} M_0 + X^\top Y). The determinant factors from the prior and likelihood contribute Σ(n+q)/2|\Sigma|^{-(n + q)/2}, matching the form of a matrix normal density. Thus, the conditional posterior is BΣ,Y,XMN(Mn,Vn,Σ)B \mid \Sigma, Y, X \sim \mathrm{MN}(M_n, V_n, \Sigma). The marginal posterior p(ΣY,X)p(\Sigma \mid Y, X) is obtained by integrating out BB: p(ΣY,X)p(Σ)p(BΣ)p(YB,Σ,X)dBp(\Sigma \mid Y, X) \propto p(\Sigma) \int p(B \mid \Sigma) \, p(Y \mid B, \Sigma, X) \, dB. The integral is the normalizing constant of the conditional posterior for BB, which, as a multivariate normal in vec(B)\operatorname{vec}(B) with covariance ΣVn\Sigma \otimes V_n, evaluates to (2π)qp/2Vnp/2Σq/2(2\pi)^{qp/2} |V_n|^{p/2} |\Sigma|^{q/2} times the exponential of the minimum quadratic form. The minimum value of the quadratic form (YXB)(YXB)+(BM0)V01(BM0)(Y - X B)^\top (Y - X B) + (B - M_0)^\top V_0^{-1} (B - M_0) at B=MnB = M_n is (YXMn)(YXMn)+(MnM0)V01(MnM0)(Y - X M_n)^\top (Y - X M_n) + (M_n - M_0)^\top V_0^{-1} (M_n - M_0). Combining with the likelihood's Σn/2|\Sigma|^{-n/2} and the prior's form, the Σq/2|\Sigma|^{q/2} from the integral cancels the Σq/2|\Sigma|^{-q/2} implicit in the conditional prior, yielding p(ΣY,X)Σ(ν0+n+p+1)/2exp(12tr[Σ1Ψn]),p(\Sigma \mid Y, X) \propto |\Sigma|^{-(\nu_0 + n + p + 1)/2} \exp\left( -\frac{1}{2} \operatorname{tr}\left[ \Sigma^{-1} \Psi_n \right] \right), where νn=ν0+n\nu_n = \nu_0 + n and Ψn=Ψ0+(YXMn)(YXMn)+(MnM0)V01(MnM0).\Psi_n = \Psi_0 + (Y - X M_n)^\top (Y - X M_n) + (M_n - M_0)^\top V_0^{-1} (M_n - M_0). This confirms that the marginal posterior is inverse Wishart, ΣY,XIW(νn,Ψn)\Sigma \mid Y, X \sim \mathrm{IW}(\nu_n, \Psi_n).

General Posterior Form and Properties

In Bayesian , the general posterior distribution for the regression coefficients BB and the error Σ\Sigma is given by p(B,ΣY,X)p(YB,Σ,X)p(BΣ)p(Σ),p(B, \Sigma \mid Y, X) \propto p(Y \mid B, \Sigma, X) \, p(B \mid \Sigma) \, p(\Sigma), where p(YB,Σ,X)p(Y \mid B, \Sigma, X) denotes the matrix normal likelihood assuming Y=XB+EY = X B + E with EMN(0,In,Σ)E \sim \mathcal{MN}(0, I_n, \Sigma), and the priors p(BΣ)p(B \mid \Sigma) and p(Σ)p(\Sigma) are chosen flexibly. This joint posterior typically lacks a in non-conjugate settings, rendering it non-standard and requiring sampling-based methods to explore its shape and sample from it. A prominent property is that the posterior mean of BB, E[BY,X]E[B \mid Y, X], acts as a Bayesian point for the coefficients, which shrinks the ordinary estimates toward the prior mean and provides regularization by balancing data evidence with prior beliefs. Credible intervals for elements of BB and Σ\Sigma are derived directly from the posterior quantiles, quantifying in a way that incorporates both sampling variability and prior specification, unlike frequentist intervals. The posterior predictive distribution for new data YnewY_{\text{new}} given observed YY and XX is obtained via marginalization, p(YnewY,X)=p(YnewB,Σ,X)p(B,ΣY,X)dBdΣ,p(Y_{\text{new}} \mid Y, X) = \int p(Y_{\text{new}} \mid B, \Sigma, X) \, p(B, \Sigma \mid Y, X) \, dB \, d\Sigma, which propagates parameter uncertainty into forecasts and often yields wider predictive intervals than plug-in classical predictions. Marginal posteriors for subsets of parameters, such as individual columns of BB corresponding to specific responses, are obtained by integrating out the remaining parameters; if Σ\Sigma is diagonal (implying independence across responses), these reduce to univariate Bayesian linear regression posteriors, but in the general case, the off-diagonal covariances induce dependence, coupling the marginals across responses.

Inference Methods

Analytical Solutions

In the conjugate prior framework for Bayesian multivariate linear regression, closed-form analytical solutions for posterior inference are available, facilitating exact computation of moments and credible regions without approximation. The model posits Y=XB+E\mathbf{Y} = \mathbf{X} \mathbf{B} + \mathbf{E}, where Y\mathbf{Y} is an n×pn \times p response matrix, X\mathbf{X} is an n×qn \times q design matrix, B\mathbf{B} is the q×pq \times p coefficient matrix, and the rows of the error matrix E\mathbf{E} are i.i.d. Np(0,Σ)\mathcal{N}_p(\mathbf{0}, \boldsymbol{\Sigma}). The conjugate prior specifies BΣMNq,p(B0,V0,Σ)\mathbf{B} \mid \boldsymbol{\Sigma} \sim \text{MN}_{q,p}(\mathbf{B}_0, \mathbf{V}_0, \boldsymbol{\Sigma}) and ΣIWp(ν0,Ψ0)\boldsymbol{\Sigma} \sim \text{IW}_p(\nu_0, \boldsymbol{\Psi}_0), leading to a posterior of the same form: BΣ,YMNq,p(Mn,Vn,Σ)\mathbf{B} \mid \boldsymbol{\Sigma}, \mathbf{Y} \sim \text{MN}_{q,p}(\mathbf{M}_n, \mathbf{V}_n, \boldsymbol{\Sigma}) and ΣYIWp(νn,Ψn)\boldsymbol{\Sigma} \mid \mathbf{Y} \sim \text{IW}_p(\nu_n, \boldsymbol{\Psi}_n), with updated parameters Vn1=V01+XX\mathbf{V}_n^{-1} = \mathbf{V}_0^{-1} + \mathbf{X}^\top \mathbf{X}, Mn=Vn(V01B0+XY)\mathbf{M}_n = \mathbf{V}_n (\mathbf{V}_0^{-1} \mathbf{B}_0 + \mathbf{X}^\top \mathbf{Y}), νn=ν0+n\nu_n = \nu_0 + n, and Ψn=Ψ0+(YXMn)(YXMn)+(MnB0)V01(MnB0)\boldsymbol{\Psi}_n = \boldsymbol{\Psi}_0 + (\mathbf{Y} - \mathbf{X} \mathbf{M}_n)^\top (\mathbf{Y} - \mathbf{X} \mathbf{M}_n) + (\mathbf{M}_n - \mathbf{B}_0)^\top \mathbf{V}_0^{-1} (\mathbf{M}_n - \mathbf{B}_0). The marginal posterior for BY\mathbf{B} \mid \mathbf{Y} follows a matrix-variate t distribution, obtained by integrating out Σ\boldsymbol{\Sigma}, with location Mn\mathbf{M}_n, row scale matrix Vn\mathbf{V}_n, column scale matrix Ψn/(νnp+1)\boldsymbol{\Psi}_n / (\nu_n - p + 1), and νnp+1\nu_n - p + 1. The exact posterior mean is given by E[BY]=Mn,E[\mathbf{B} \mid \mathbf{Y}] = \mathbf{M}_n, which represents a shrinkage estimator blending the prior mean B0\mathbf{B}_0 and the ordinary estimator (XX)1XY(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}. The posterior variance is Var(vec(B)Y)=VnE[ΣY],\text{Var}(\text{vec}(\mathbf{B}) \mid \mathbf{Y}) = \mathbf{V}_n \otimes E[\boldsymbol{\Sigma} \mid \mathbf{Y}], where the posterior mean of the covariance matrix is E[ΣY]=Ψnνnp1,E[\boldsymbol{\Sigma} \mid \mathbf{Y}] = \frac{\boldsymbol{\Psi}_n}{\nu_n - p - 1}, provided νn>p+1\nu_n > p + 1 to ensure finite moments; this expectation shrinks the prior scale toward the residual sum of squares adjusted by the data. Credible regions for elements or linear combinations of B\mathbf{B} can be constructed as ellipsoidal sets derived from the matrix-t posterior, leveraging its quadratic form properties to define joint probability contours at desired levels (e.g., 95%). For instance, a credible region for B\mathbf{B} satisfies (BMn)Vn1(BMn)Fp(νnp+1),νnp+1(\mathbf{B} - \mathbf{M}_n)^\top \mathbf{V}_n^{-1} (\mathbf{B} - \mathbf{M}_n) \sim F_{p(\nu_n - p + 1), \nu_n - p + 1} scaled by the column degrees of freedom, enabling exact multivariate intervals. In low-dimensional settings where qq and pp are small (e.g., q,p5q, p \leq 5), these regions permit exact over the posterior density for precise coverage probabilities and visualization. These analytical solutions are restricted to the conjugate prior setup, as derived in the Posterior Distribution section, and do not extend straightforwardly to non-conjugate or flexible priors. Computationally, they become impractical in high dimensions due to the cubic-time matrix inversions required for Vn\mathbf{V}_n (order O(q3)O(q^3)) and evaluations of the inverse Wishart density (order O(p3)O(p^3)), limiting applicability to moderate-sized problems with q,p<20q, p < 20.

Computational Approaches

In Bayesian multivariate linear regression, when analytical posterior is intractable due to non-conjugate priors or high dimensionality, computational methods such as (MCMC) and variational provide approximations through sampling or optimization. These approaches enable posterior summaries like credible intervals and predictive distributions by iteratively exploring the parameter space. MCMC methods, particularly , are widely used for posterior sampling in semi-conjugate settings, such as when a flat prior is placed on B. The alternates draws from the full conditional posteriors: assuming a flat prior on B, the regression coefficients B\mathbf{B} given the Σ\Sigma, response Y\mathbf{Y}, and predictors X\mathbf{X} follow a , p(BΣ,Y,X)MNq×p(M,(XX)1,Σ)p(\mathbf{B} \mid \Sigma, \mathbf{Y}, \mathbf{X}) \sim \mathcal{MN}_{q \times p}(\mathbf{M}, (\mathbf{X}^\top \mathbf{X})^{-1}, \Sigma), while Σ\Sigma given B\mathbf{B}, Y\mathbf{Y}, and X\mathbf{X} follows an , p(ΣB,Y,X)IW(S,ν)p(\Sigma \mid \mathbf{B}, \mathbf{Y}, \mathbf{X}) \sim \mathcal{IW}(\mathbf{S}, \nu). For the general on B | Σ ~ MN(B_0, V_0, Σ), the conditional is MN(M_n, V_n, Σ) with V_n^{-1} = V_0^{-1} + X^T X and M_n = V_n (V_0^{-1} B_0 + X^T Y). This block-conditional structure ensures efficient mixing, even when the joint prior is not fully conjugate, as demonstrated in applications to models. For non-standard priors that disrupt conjugacy, such as hierarchical or shrinkage-inducing distributions, Metropolis-Hastings steps within the MCMC chain target the full conditionals by proposing updates from tailored proposal distributions like random walks on the parameter space. Variational offers a faster, deterministic alternative by approximating the true posterior with a simpler factorized distribution, often a mean-field product of matrix normal and inverse Wishart components, q(B,Σ)=q(B)q(Σ)q(\mathbf{B}, \Sigma) = q(\mathbf{B}) q(\Sigma). The approximation minimizes the Kullback-Leibler to the true posterior, equivalently maximizing the (ELBO) through coordinate ascent updates on the variational parameters. This method scales well to large datasets, with applications in high-dimensional predictive regressions where it achieves near-MCMC accuracy while reducing computation time by orders of magnitude. Implementations of these methods are available in several software packages, enhancing accessibility for practitioners. The Stan probabilistic programming language, interfaced via the brms, supports flexible specification of multivariate linear models with MCMC (via ) or variational inference, scaling to datasets with hundreds of observations and responses through . JAGS provides for custom Bayesian models, including multivariate regressions, via an R interface like rjags, suitable for moderate-sized problems. For high-dimensional settings, the BayesSUR implements variational and MCMC inference for , a generalization of multivariate linear models, handling up to thousands of predictors efficiently.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.