Hubbry Logo
logo
Multivariate random variable
Community hub

Multivariate random variable

logo
0 subscribers
Read side by side
from Wikipedia

In probability and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

Random vectors are often used as the underlying implementation of various types of aggregate random variables, e.g. a random matrix, random tree, random sequence, stochastic process, etc.

Formally, a multivariate random variable is a column vector (or its transpose, which is a row vector) whose components are random variables on the probability space , where is the sample space, is the sigma-algebra (the collection of all events), and is the probability measure (a function returning each event's probability).

Probability distribution

[edit]

Every random vector gives rise to a probability measure on with the Borel algebra as the underlying sigma-algebra. This measure is also known as the joint probability distribution, the joint distribution, or the multivariate distribution of the random vector.

The distributions of each of the component random variables are called marginal distributions. The conditional probability distribution of given is the probability distribution of when is known to be a particular value.

The cumulative distribution function of a random vector is defined as[1]: p.15 

where .

Operations on random vectors

[edit]

Random vectors can be subjected to the same kinds of algebraic operations as can non-random vectors: addition, subtraction, multiplication by a scalar, and the taking of inner products.

Affine transformations

[edit]

Similarly, a new random vector can be defined by applying an affine transformation to a random vector :

, where is an matrix and is an column vector.

If is an invertible matrix and has a probability density function , then the probability density of is

.

Invertible mappings

[edit]

More generally we can study invertible mappings of random vectors.[2]: p.284–285 

Let be a one-to-one mapping from an open subset of onto a subset of , let have continuous partial derivatives in and let the Jacobian determinant of be zero at no point of . Assume that the real random vector has a probability density function and satisfies . Then the random vector is of probability density

where denotes the indicator function and set denotes support of .

Expected value

[edit]

The expected value or mean of a random vector is a fixed vector whose elements are the expected values of the respective random variables.[3]: p.333 

Covariance and cross-covariance

[edit]

Definitions

[edit]

The covariance matrix (also called second central moment or variance-covariance matrix) of an random vector is an matrix whose (i,j)th element is the covariance between the i th and the j th random variables. The covariance matrix is the expected value, element by element, of the matrix computed as , where the superscript T refers to the transpose of the indicated vector:[2]: p. 464 [3]: p.335 

By extension, the cross-covariance matrix between two random vectors and ( having elements and having elements) is the matrix[3]: p.336 

where again the matrix expectation is taken element-by-element in the matrix. Here the (i,j)th element is the covariance between the i th element of and the j th element of .

Properties

[edit]

The covariance matrix is a symmetric matrix, i.e.[2]: p. 466 

.

The covariance matrix is a positive semidefinite matrix, i.e.[2]: p. 465 

.

The cross-covariance matrix is simply the transpose of the matrix , i.e.

.

Uncorrelatedness

[edit]

Two random vectors and are called uncorrelated if

.

They are uncorrelated if and only if their cross-covariance matrix is zero.[3]: p.337 

Correlation and cross-correlation

[edit]

Definitions

[edit]

The correlation matrix (also called second moment) of an random vector is an matrix whose (i,j)th element is the correlation between the i th and the j th random variables. The correlation matrix is the expected value, element by element, of the matrix computed as , where the superscript T refers to the transpose of the indicated vector:[4]: p.190 [3]: p.334 

By extension, the cross-correlation matrix between two random vectors and ( having elements and having elements) is the matrix

Properties

[edit]

The correlation matrix is related to the covariance matrix by

.

Similarly for the cross-correlation matrix and the cross-covariance matrix:

Orthogonality

[edit]

Two random vectors of the same size and are called orthogonal if

.

Independence

[edit]

Two random vectors and are called independent if for all and

where and denote the cumulative distribution functions of and and denotes their joint cumulative distribution function. Independence of and is often denoted by . Written component-wise, and are called independent if for all

.

Characteristic function

[edit]

The characteristic function of a random vector with components is a function that maps every vector to a complex number. It is defined by[2]: p. 468 

.

Further properties

[edit]

Expectation of a quadratic form

[edit]

One can take the expectation of a quadratic form in the random vector as follows:[5]: p.170–171 

where is the covariance matrix of and refers to the trace of a matrix — that is, to the sum of the elements on its main diagonal (from upper left to lower right). Since the quadratic form is a scalar, so is its expectation.

Proof: Let be an random vector with and and let be an non-stochastic matrix.

Then based on the formula for the covariance, if we denote and , we see that:

Hence

which leaves us to show that

This is true based on the fact that one can cyclically permute matrices when taking a trace without changing the end result (e.g.: ).

We see that

And since

is a scalar, then

trivially. Using the permutation we get:

and by plugging this into the original formula we get:

Expectation of the product of two different quadratic forms

[edit]

One can take the expectation of the product of two different quadratic forms in a zero-mean Gaussian random vector as follows:[5]: pp. 162–176 

where again is the covariance matrix of . Again, since both quadratic forms are scalars and hence their product is a scalar, the expectation of their product is also a scalar.

Applications

[edit]

Portfolio theory

[edit]

In portfolio theory in finance, an objective often is to choose a portfolio of risky assets such that the distribution of the random portfolio return has desirable properties. For example, one might want to choose the portfolio return having the lowest variance for a given expected value. Here the random vector is the vector of random returns on the individual assets, and the portfolio return p (a random scalar) is the inner product of the vector of random returns with a vector w of portfolio weights — the fractions of the portfolio placed in the respective assets. Since p = wT, the expected value of the portfolio return is wTE() and the variance of the portfolio return can be shown to be wTCw, where C is the covariance matrix of .

Regression theory

[edit]

In linear regression theory, we have data on n observations on a dependent variable y and n observations on each of k independent variables xj. The observations on the dependent variable are stacked into a column vector y; the observations on each independent variable are also stacked into column vectors, and these latter column vectors are combined into a design matrix X (not denoting a random vector in this context) of observations on the independent variables. Then the following regression equation is postulated as a description of the process that generated the data:

where β is a postulated fixed but unknown vector of k response coefficients, and e is an unknown random vector reflecting random influences on the dependent variable. By some chosen technique such as ordinary least squares, a vector is chosen as an estimate of β, and the estimate of the vector e, denoted , is computed as

Then the statistician must analyze the properties of and , which are viewed as random vectors since a randomly different selection of n cases to observe would have resulted in different values for them.

Vector time series

[edit]

The evolution of a k×1 random vector through time can be modelled as a vector autoregression (VAR) as follows:

where the i-periods-back vector observation is called the i-th lag of , c is a k × 1 vector of constants (intercepts), Ai is a time-invariant k × k matrix and is a k × 1 random vector of error terms.

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A multivariate random variable, also referred to as a random vector, is a measurable function from a probability space to Rn\mathbb{R}^n, mapping outcomes to an n-dimensional vector (X1,X2,,Xn)(X_1, X_2, \dots, X_n)^\top where each XiX_i is a scalar random variable, enabling the analysis of joint behaviors among multiple interdependent quantities.[1] The joint distribution of a multivariate random variable captures the probabilities of simultaneous outcomes for its components, specified by the joint cumulative distribution function FX1,,Xn(x1,,xn)=P(X1x1,,Xnxn)F_{X_1, \dots, X_n}(x_1, \dots, x_n) = P(X_1 \leq x_1, \dots, X_n \leq x_n) for continuous cases or the joint probability mass function for discrete cases.[1] Marginal distributions for individual components are derived by summing or integrating the joint distribution over the other variables, such as fXi(xi)=fX(x)dx1dxi1dxi+1dxnf_{X_i}(x_i) = \int f_X(x) \, dx_1 \cdots dx_{i-1} dx_{i+1} \cdots dx_n for continuous densities.[1] Dependence between components is quantified through measures like covariance, defined as Cov(Xj,Xk)=E[(Xjμj)(Xkμk)]\operatorname{Cov}(X_j, X_k) = E[(X_j - \mu_j)(X_k - \mu_k)], and the correlation coefficient ρXj,Xk=Cov(Xj,Xk)Var(Xj)Var(Xk)\rho_{X_j, X_k} = \frac{\operatorname{Cov}(X_j, X_k)}{\sqrt{\operatorname{Var}(X_j) \operatorname{Var}(X_k)}}, which ranges from -1 to 1 and indicates linear association strength.[2] Independence holds if the joint distribution factors into the product of marginals, implying zero correlation, though the converse is not always true.[1] The multivariate normal distribution stands as the cornerstone example of a multivariate random variable, characterized by a mean vector μ\mu and covariance matrix Σ\Sigma, with density f(x)=(2π)d/2Σ1/2exp{12(xμ)Σ1(xμ)}f(x) = (2\pi)^{-d/2} |\Sigma|^{-1/2} \exp\left\{ -\frac{1}{2} (x - \mu)^\top \Sigma^{-1} (x - \mu) \right\} for xRdx \in \mathbb{R}^d.[3] It possesses key properties such as normality of all linear combinations and marginals, and independence of components when uncorrelated.[2] Multivariate random variables find broad applications in fields like statistics, where they model multiple measurements such as age, blood pressure, and cholesterol levels in health studies, and in geophysics for analyzing correlated seismic data.[4][2]

Fundamentals

Definition

In probability theory, a multivariate random variable, also known as a random vector, is formally defined as a measurable function X:ΩRn\mathbf{X}: \Omega \to \mathbb{R}^n from a probability space (Ω,F,P)(\Omega, \mathcal{F}, P) to the nn-dimensional Euclidean space Rn\mathbb{R}^n, where n2n \geq 2 and X=(X1,,Xn)\mathbf{X} = (X_1, \dots, X_n) with each Xi:ΩRX_i: \Omega \to \mathbb{R} being a univariate random variable.[5] This definition ensures that the joint behavior of the components can be analyzed through the induced probability measure on Rn\mathbb{R}^n.[5] Unlike a univariate random variable, which captures the probabilistic behavior of a single scalar outcome, a multivariate random variable emphasizes the interdependence and joint distribution among multiple components, allowing for the study of relationships such as correlation or dependence across dimensions.[5] While the standard formulation targets real-valued spaces Rn\mathbb{R}^n, the concept extends to more general settings, including complex vector spaces Cn\mathbb{C}^n in applications like signal processing, where the random vector is a measurable mapping to Cn\mathbb{C}^n.[6]

Notation and Examples

Standard notation for a multivariate random variable employs boldface letters to denote vectors, with X=(X1,X2,,Xp)\mathbf{X} = (X_1, X_2, \dots, X_p)^\top representing a pp-dimensional random vector, where each XiX_i is a univariate random variable and the superscript \top indicates transposition to a column vector. A specific realization or observed value of this vector is denoted by lowercase x=(x1,x2,,xp)\mathbf{x} = (x_1, x_2, \dots, x_p)^\top, allowing distinction between the random entity and its outcomes. To illustrate, consider a bivariate case where X=(X1,X2)\mathbf{X} = (X_1, X_2)^\top captures the height and weight of adult humans in a population; here, X1X_1 might represent height in inches and X2X_2 weight in pounds, with joint variability reflecting biological dependencies.[4] For a continuous example, a multivariate normal random vector XNp(μ,Σ)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) has mean vector μ\boldsymbol{\mu} and positive semi-definite covariance matrix Σ\boldsymbol{\Sigma}, commonly modeling correlated measurements like stock returns or sensor data in pp dimensions. In the discrete setting, rolling two fair six-sided dice yields X=(X1,X2)\mathbf{X} = (X_1, X_2)^\top, where X1X_1 and X2X_2 are the outcomes (each uniformly distributed over {1,2,,6}\{1, 2, \dots, 6\}), and the joint probability mass function is P(X1=i,X2=j)=1/36P(X_1 = i, X_2 = j) = 1/36 for i,j=1,,6i, j = 1, \dots, 6, demonstrating uniform joint support over 36 points.[7] Basic vector operations on multivariate random variables proceed component-wise: addition X+Y=(X1+Y1,,Xp+Yp)\mathbf{X} + \mathbf{Y} = (X_1 + Y_1, \dots, X_p + Y_p)^\top combines vectors, while scalar multiplication aX=(aX1,,aXp)a\mathbf{X} = (aX_1, \dots, aX_p)^\top scales them, though these lack probabilistic interpretation in isolation here. The foundations of multivariate analysis trace to 19th-century statistics, notably Karl Pearson's 1896 exploration of correlations among multiple organ measurements in biological data, which laid groundwork for handling joint variability beyond univariate cases.[8]

Joint Distributions

Discrete Multivariate Random Variables

A discrete multivariate random variable is a vector X=(X1,X2,,Xn)\mathbf{X} = (X_1, X_2, \dots, X_n) where each component XiX_i takes values in a countable set, typically integers or subsets thereof.[9] The probability structure is fully specified by the joint probability mass function (PMF), which assigns probabilities to each possible outcome in the support.[10] The joint PMF is defined as p(x)=P(X=x)p(\mathbf{x}) = P(\mathbf{X} = \mathbf{x}), where x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \dots, x_n) is a specific realization, and it satisfies p(x)0p(\mathbf{x}) \geq 0 for all x\mathbf{x} in the support.[11] The normalization condition requires that the probabilities sum to unity over the entire support: xp(x)=1\sum_{\mathbf{x}} p(\mathbf{x}) = 1, where the sum is taken over all possible x\mathbf{x}.[10] The support of X\mathbf{X} is the finite or countably infinite set of points in Zn\mathbb{Z}^n (or a subset) where p(x)>0p(\mathbf{x}) > 0.[9] Consider the example of two independent fair coin flips, where X=(X1,X2)\mathbf{X} = (X_1, X_2) with X1X_1 indicating the outcome of the first flip (0 for tails, 1 for heads) and X2X_2 for the second. The support is {(0,0),(0,1),(1,0),(1,1)}\{(0,0), (0,1), (1,0), (1,1)\}, and the joint PMF is uniform due to independence and fairness: p(x)=0.25p(\mathbf{x}) = 0.25 for each point in the support.[12] This can be represented in a table:
X1X2X_1 \setminus X_201
00.250.25
10.250.25
To compute the probability of X\mathbf{X} belonging to a subset AA of the support, sum the joint PMF over the relevant points: P(XA)=xAp(x)P(\mathbf{X} \in A) = \sum_{\mathbf{x} \in A} p(\mathbf{x}).[11] For instance, in the coin flip example, P(X1=1)=x2p(1,x2)=0.25+0.25=0.5P(X_1 = 1) = \sum_{x_2} p(1, x_2) = 0.25 + 0.25 = 0.5.[12]

Continuous Multivariate Random Variables

A continuous multivariate random variable X=(X1,,Xn)\mathbf{X} = (X_1, \dots, X_n) takes values in Rn\mathbb{R}^n, where the probability structure is described by a joint probability density function (PDF) f(x)f(\mathbf{x}) that is nonnegative for all xRn\mathbf{x} \in \mathbb{R}^n and integrates to 1 over the entire space, i.e., Rnf(x)dx=1\int_{\mathbb{R}^n} f(\mathbf{x}) \, d\mathbf{x} = 1.[10][9] The nonnegativity ensures that probabilities are always positive, while the integrability condition normalizes the total probability.[13] The probability that X\mathbf{X} falls into any measurable set ARnA \subseteq \mathbb{R}^n is given by the integral of the joint PDF over that set: P(XA)=Af(x)dxP(\mathbf{X} \in A) = \int_A f(\mathbf{x}) \, d\mathbf{x}. This integral represents the volume under the density surface over AA, contrasting with discrete cases by using continuous integration rather than summation.[14] The joint cumulative distribution function (CDF) for X\mathbf{X} is defined as F(x)=P(Xx)=P(X1x1,,Xnxn)=x1xnf(u)dundu1F(\mathbf{x}) = P(\mathbf{X} \leq \mathbf{x}) = P(X_1 \leq x_1, \dots, X_n \leq x_n) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_n} f(\mathbf{u}) \, d\mathbf{u}_n \cdots d\mathbf{u}_1, which provides the probability that all components are less than or equal to their respective values.[9][14] For continuous distributions, the CDF is continuous and differentiable, with the joint PDF obtained as its mixed partial derivative: f(x)=nx1xnF(x)f(\mathbf{x}) = \frac{\partial^n}{\partial x_1 \cdots \partial x_n} F(\mathbf{x}).[4] A prominent example is the bivariate normal distribution for n=2n=2, where the joint PDF is
f(x,y)=12πσxσy1ρ2exp(12(1ρ2)[(xμx)2σx2+(yμy)2σy22ρ(xμx)(yμy)σxσy]), f(x,y) = \frac{1}{2\pi \sigma_x \sigma_y \sqrt{1-\rho^2}} \exp\left( -\frac{1}{2(1-\rho^2)} \left[ \frac{(x-\mu_x)^2}{\sigma_x^2} + \frac{(y-\mu_y)^2}{\sigma_y^2} - \frac{2\rho (x-\mu_x)(y-\mu_y)}{\sigma_x \sigma_y} \right] \right),
with means μx,μy\mu_x, \mu_y, standard deviations σx,σy>0\sigma_x, \sigma_y > 0, and correlation coefficient ρ(1,1)\rho \in (-1,1).[15] This density captures linear dependence through ρ\rho and reduces to the product of independent univariate normals when ρ=0\rho = 0.[16] Marginal distributions can be obtained by integrating out one variable, yielding univariate normals.[17]

Moments and Central Tendency

Expected Value Vector

The expected value vector of a multivariate random variable X=(X1,,Xn)T\mathbf{X} = (X_1, \dots, X_n)^T, denoted μ=E[X]\boldsymbol{\mu} = E[\mathbf{X}], is the vector whose components are the expected values of the individual random variables: μ=(E[X1],,E[Xn])T\boldsymbol{\mu} = (E[X_1], \dots, E[X_n])^T.[18] This vector represents the first moment of the joint distribution and serves as a measure of central tendency for the multivariate distribution.[19] The ii-th component μi=E[Xi]\mu_i = E[X_i] is computed by integrating or summing the ii-th coordinate weighted by the joint probability distribution of X\mathbf{X}. For a discrete multivariate random variable with joint probability mass function p(x)p(\mathbf{x}),
E[Xi]=xXxi[p](/page/P′′)(x), E[X_i] = \sum_{\mathbf{x} \in \mathcal{X}} x_i \, [p](/page/P′′)(\mathbf{x}),
where X\mathcal{X} is the support of X\mathbf{X}.[20] For a continuous multivariate random variable with joint probability density function f(x)f(\mathbf{x}),
E[Xi]=Rnxif(x)dx. E[X_i] = \int_{\mathbb{R}^n} x_i \, f(\mathbf{x}) \, d\mathbf{x}.
These expressions reduce to the marginal expectations when integrated or summed over the other variables, but the full joint distribution is required for exact computation in dependent cases.[14] The expected value operator exhibits linearity, a fundamental property that facilitates analysis and computation. Specifically, for scalar constants aa and bb, and random vectors X\mathbf{X} and Y\mathbf{Y} (possibly of compatible dimensions),
E[aX+bY]=aE[X]+bE[Y]. E[a \mathbf{X} + b \mathbf{Y}] = a E[\mathbf{X}] + b E[\mathbf{Y}].
This holds component-wise and extends to affine transformations Y=AX+b\mathbf{Y} = A \mathbf{X} + \mathbf{b}, yielding E[Y]=AE[X]+bE[\mathbf{Y}] = A E[\mathbf{X}] + \mathbf{b}, where AA is a matrix and b\mathbf{b} a vector.[21][18] An unbiased estimator of μ\boldsymbol{\mu} is the sample mean vector obtained from an independent and identically distributed sample X1,,Xm\mathbf{X}_1, \dots, \mathbf{X}_m drawn from the distribution of X\mathbf{X}:
Xˉ=1mj=1mXj. \bar{\mathbf{X}} = \frac{1}{m} \sum_{j=1}^m \mathbf{X}_j.
By linearity of expectation, E[Xˉ]=E[X]=μE[\bar{\mathbf{X}}] = E[\mathbf{X}] = \boldsymbol{\mu}, confirming its unbiasedness regardless of the dependence structure among the components.[22][21]

Covariance Matrix

The covariance matrix of a multivariate random variable X=(X1,,Xp)T\mathbf{X} = (X_1, \dots, X_p)^T with mean vector μ=E[X]\boldsymbol{\mu} = E[\mathbf{X}] is the p×pp \times p symmetric matrix Σ=Cov(X)=E[(Xμ)(Xμ)T]\boldsymbol{\Sigma} = \operatorname{Cov}(\mathbf{X}) = E[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T].[23] The element in the ii-th row and jj-th column of Σ\boldsymbol{\Sigma} is the covariance σij=Cov(Xi,Xj)=E[(Xiμi)(Xjμj)]\sigma_{ij} = \operatorname{Cov}(X_i, X_j) = E[(X_i - \mu_i)(X_j - \mu_j)], which quantifies the joint variability between components XiX_i and XjX_j.[23] When i=ji = j, σii\sigma_{ii} reduces to the variance Var(Xi)\operatorname{Var}(X_i), capturing the marginal spread of each component.[24] For computation, the elements of the covariance matrix are obtained via the definition of expectation. In the discrete case, where X\mathbf{X} has joint probability mass function p(x)p(\mathbf{x}),
σij=x(xiμi)(xjμj)p(x), \sigma_{ij} = \sum_{\mathbf{x}} (x_i - \mu_i)(x_j - \mu_j) p(\mathbf{x}),
assuming the second moments exist.[25] In the continuous case, with joint probability density function f(x)f(\mathbf{x}),
σij=(xiμi)(xjμj)f(x)dx. \sigma_{ij} = \int (x_i - \mu_i)(x_j - \mu_j) f(\mathbf{x}) \, d\mathbf{x}.
These expressions extend the univariate covariance to measure linear dependence across multiple dimensions.[25] The covariance matrix possesses key properties that reflect its role in describing multivariate spread. It is positive semi-definite, meaning that for any non-zero p×1p \times 1 vector a\mathbf{a}, aTΣa0\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} \geq 0, with equality if and only if a\mathbf{a} is in the null space of Σ\boldsymbol{\Sigma}; this follows from aTΣa=Var(aTX)0\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} = \operatorname{Var}(\mathbf{a}^T \mathbf{X}) \geq 0.[23] Additionally, the trace of Σ\boldsymbol{\Sigma}, tr(Σ)=i=1pσii\operatorname{tr}(\boldsymbol{\Sigma}) = \sum_{i=1}^p \sigma_{ii}, equals the sum of the component variances, providing a scalar summary of total variability.[24] In practice, the population covariance matrix is estimated from a sample of mm independent observations x1,,xm\mathbf{x}_1, \dots, \mathbf{x}_m using the unbiased sample covariance matrix
S=1m1k=1m(xkXˉ)(xkXˉ)T, \mathbf{S} = \frac{1}{m-1} \sum_{k=1}^m (\mathbf{x}_k - \bar{\mathbf{X}})(\mathbf{x}_k - \bar{\mathbf{X}})^T,
where Xˉ=1mk=1mxk\bar{\mathbf{X}} = \frac{1}{m} \sum_{k=1}^m \mathbf{x}_k is the sample mean vector.[26] This estimator is positive semi-definite and converges to Σ\boldsymbol{\Sigma} as mm \to \infty under standard conditions.[26] For two multivariate random variables X\mathbf{X} and Y\mathbf{Y} with mean vectors μX\boldsymbol{\mu}_X and μY\boldsymbol{\mu}_Y, the cross-covariance matrix is the rectangular matrix Cov(X,Y)=E[(XμX)(YμY)T]\operatorname{Cov}(\mathbf{X}, \mathbf{Y}) = E[(\mathbf{X} - \boldsymbol{\mu}_X)(\mathbf{Y} - \boldsymbol{\mu}_Y)^T], whose elements are the pairwise covariances between components of X\mathbf{X} and Y\mathbf{Y}.[23] When X=Y\mathbf{X} = \mathbf{Y}, this reduces to the standard covariance matrix.[23]

Measures of Dependence

Correlation Coefficients

In multivariate statistics, the correlation coefficient standardizes the measure of linear dependence between pairs of random variables, providing a scale-free interpretation of their relationship. For components XiX_i and XjX_j of a random vector with covariance σij\sigma_{ij} and standard deviations σi\sigma_i and σj\sigma_j, the Pearson product-moment correlation coefficient is defined as
ρij=σijσiσj. \rho_{ij} = \frac{\sigma_{ij}}{\sigma_i \sigma_j}.
This scalar lies in the interval [1,1][-1, 1], where ρij=1\rho_{ij} = 1 indicates a perfect positive linear relationship (i.e., Xj=aXi+bX_j = aX_i + b with a>0a > 0), ρij=1\rho_{ij} = -1 indicates a perfect negative linear relationship (with a<0a < 0), and values near 0 suggest weak or no linear association.[27][28] The full correlation matrix P=(ρij)\boldsymbol{\Rho} = (\rho_{ij}) assembles these pairwise coefficients, featuring 1s along the diagonal since each variable is perfectly correlated with itself. A key property is scale invariance: the entries of P\boldsymbol{\Rho} remain unchanged under affine transformations of the form Xi=ciXi+diX_i' = c_i X_i + d_i (with ci>0c_i > 0), unlike raw covariances which depend on the units of measurement. This invariance facilitates comparisons across heterogeneous variables, such as height in meters and weight in kilograms. Additionally, P\boldsymbol{\Rho} is symmetric and positive semi-definite, ensuring all eigenvalues are non-negative.[28] Partial correlation refines this measure by isolating the linear association between two variables while controlling for the effects of others, useful in multivariate settings to detect direct dependencies. The partial correlation ρijk\rho_{ij \cdot \mathbf{k}} between XiX_i and XjX_j given a set of controlling variables k\mathbf{k} ranges in [1,1][-1, 1] and inherits scale invariance, with interpretation analogous to the simple case: it equals ±1\pm 1 for perfect partial linear relations after adjustment.[29] In multiple regression, the multiple correlation coefficient RR assesses the combined linear predictive power of several variables on a target, defined in the population as
R=σXYTΣXX1σXYσY2, R = \sqrt{ \frac{ \boldsymbol{\sigma}_{XY}^T \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\sigma}_{XY} }{ \sigma_Y^2 } },
where σXY=Cov(X,Y)\boldsymbol{\sigma}_{XY} = \operatorname{Cov}(\mathbf{X}, Y) is the cross-covariance vector between the predictor vector X\mathbf{X} and response YY, ΣXX\boldsymbol{\Sigma}_{XX} is the covariance matrix of X\mathbf{X}, and σY2=Var(Y)\sigma_Y^2 = \operatorname{Var}(Y); RR thus ranges in [0,1][0, 1], with R2R^2 indicating the explained variance proportion.[30] As a non-linear alternative, Spearman's rank correlation coefficient evaluates monotonic rather than strictly linear dependencies, computed via Pearson correlation on ranked data to handle ordinal scales or outliers robustly.

Uncorrelatedness and Orthogonality

In the context of a multivariate random vector X=(X1,,Xn)\mathbf{X} = (X_1, \dots, X_n)^\top, the components are said to be uncorrelated if Cov(Xi,Xj)=0\operatorname{Cov}(X_i, X_j) = 0 for all iji \neq j. This condition is equivalent to E[XiXj]=E[Xi]E[Xj]E[X_i X_j] = E[X_i] E[X_j] for iji \neq j, assuming the expectations exist.[31][32] A key property of uncorrelated components is that the covariance matrix Σ=Cov(X)\Sigma = \operatorname{Cov}(\mathbf{X}) is diagonal, with the variances Var(Xi)\operatorname{Var}(X_i) along the main diagonal. Additionally, for uncorrelated X1,,XnX_1, \dots, X_n, the variance of their sum satisfies Var(i=1nXi)=i=1nVar(Xi)\operatorname{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \operatorname{Var}(X_i), which simplifies computations in linear combinations and central limit theorem applications.[18][33] Orthogonality of random variables XiX_i and XjX_j (with iji \neq j) is defined by E[XiXj]=0E[X_i X_j] = 0, without requiring zero means, distinguishing it from uncorrelatedness which centers the variables first. This concept is particularly valuable in signal processing, where orthogonal signals or noise components facilitate efficient decomposition and filtering in Hilbert spaces of random variables.[34][35] An illustrative example arises in expansions using orthogonal polynomials, such as polynomial chaos expansions, where basis polynomials are chosen to be orthogonal with respect to the inner product defined by the expectation over a probability measure, enabling efficient representation of multivariate random functions.[36] Independence implies uncorrelatedness but is a stronger condition involving the full joint distribution.[37]

Independence and Conditional Concepts

Joint Independence

In probability theory, two random vector-valued random variables X=(X1,,Xm)\mathbf{X} = (X_1, \dots, X_m) and Y=(Y1,,Yn)\mathbf{Y} = (Y_1, \dots, Y_n) defined on the same probability space are said to be independent if, for every pair of Borel sets ARmA \subseteq \mathbb{R}^m and BRnB \subseteq \mathbb{R}^n, the joint probability satisfies P(XA,YB)=P(XA)P(YB)P(\mathbf{X} \in A, \mathbf{Y} \in B) = P(\mathbf{X} \in A) P(\mathbf{Y} \in B).[4] This condition is equivalent to the joint cumulative distribution function (CDF) factorizing as FX,Y(x,y)=FX(x)FY(y)F_{\mathbf{X},\mathbf{Y}}(\mathbf{x}, \mathbf{y}) = F_{\mathbf{X}}(\mathbf{x}) F_{\mathbf{Y}}(\mathbf{y}) for all xRm\mathbf{x} \in \mathbb{R}^m, yRn\mathbf{y} \in \mathbb{R}^n.[38] For discrete random vectors, this corresponds to the joint probability mass function (PMF) factorizing as pX,Y(x,y)=pX(x)pY(y)p_{\mathbf{X},\mathbf{Y}}(\mathbf{x}, \mathbf{y}) = p_{\mathbf{X}}(\mathbf{x}) p_{\mathbf{Y}}(\mathbf{y}); for continuous random vectors, the joint probability density function (PDF), if it exists, satisfies fX,Y(x,y)=fX(x)fY(y)f_{\mathbf{X},\mathbf{Y}}(\mathbf{x}, \mathbf{y}) = f_{\mathbf{X}}(\mathbf{x}) f_{\mathbf{Y}}(\mathbf{y}).[4] The joint distribution under independence is thus the product measure of the marginal distributions of X\mathbf{X} and Y\mathbf{Y}.[39] This joint independence is equivalent to component-wise independence: X\mathbf{X} and Y\mathbf{Y} are independent if and only if every component XiX_i is independent of every component YjY_j for all i=1,,mi = 1, \dots, m and j=1,,nj = 1, \dots, n.[4] Subsets of components from X\mathbf{X} and Y\mathbf{Y} also remain independent under this condition.[4] Independence is symmetric, meaning if X\mathbf{X} is independent of Y\mathbf{Y}, then Y\mathbf{Y} is independent of X\mathbf{X}.[38] A key property is that joint independence implies uncorrelatedness: the covariance between any XiX_i and YjY_j is zero, so Cov(Xi,Yj)=0\text{Cov}(X_i, Y_j) = 0 for all i,ji, j.[40] However, the converse does not hold in general, as zero covariance measures only linear dependence, whereas independence captures all forms of probabilistic dependence.[40] Additionally, independence is preserved under measurable functions: if X\mathbf{X} and Y\mathbf{Y} are independent, then for any measurable functions g:RmRpg: \mathbb{R}^m \to \mathbb{R}^p and h:RnRqh: \mathbb{R}^n \to \mathbb{R}^q, the transformed vectors g(X)g(\mathbf{X}) and h(Y)h(\mathbf{Y}) are also independent.[4] A classic example illustrates the product measure structure. Consider two independent continuous uniform random variables UUniform[0,1]U \sim \text{Uniform}[0,1] and VUniform[0,1]V \sim \text{Uniform}[0,1]; their joint PDF is fU,V(u,v)=1f_{U,V}(u,v) = 1 for 0u,v10 \leq u,v \leq 1, which is the product of the marginal PDFs fU(u)=1f_U(u) = 1 and fV(v)=1f_V(v) = 1 on the unit square.[41] This uniform distribution on the product space [0,1]×[0,1][0,1] \times [0,1] exemplifies how independence leads to a rectangular support and separable density, contrasting with dependent cases where the support might be restricted or the density non-factorizable.[39]

Marginal and Conditional Distributions

In multivariate probability distributions, the marginal distribution of a subset of random variables is obtained by integrating or summing the joint distribution over the remaining variables, effectively ignoring their values to focus on the distribution of the subset alone. For a discrete multivariate random variable X=(X1,,Xn)\mathbf{X} = (X_1, \dots, X_n) with joint probability mass function p(x)p(\mathbf{x}), the marginal distribution of the ii-th component XiX_i is given by
pXi(xi)=xjip(x), p_{X_i}(x_i) = \sum_{x_j \neq i} p(\mathbf{x}),
where the sum is taken over all possible values of the other components xjx_j for jij \neq i.[42] Similarly, for a continuous multivariate random variable with joint probability density function f(x)f(\mathbf{x}), the marginal density of XiX_i is
fXi(xi)=f(x)dxi, f_{X_i}(x_i) = \int f(\mathbf{x}) \, d\mathbf{x}_{-i},
with the integral over the variables xi\mathbf{x}_{-i} excluding xix_i.[43] These marginal distributions are themselves univariate (or multivariate if the subset has more than one variable), providing the probability laws for the selected components independently of the others.[44] The conditional distribution describes the distribution of one subset of variables given fixed values of another subset, capturing dependencies within the joint distribution. For continuous random variables partitioned as X=(X1,X2)\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2) with joint density f(x1,x2)f(\mathbf{x}_1, \mathbf{x}_2), the conditional density of X1\mathbf{X}_1 given X2=x2\mathbf{X}_2 = \mathbf{x}_2 is
f(x1x2)=f(x1,x2)f(x2), f(\mathbf{x}_1 \mid \mathbf{x}_2) = \frac{f(\mathbf{x}_1, \mathbf{x}_2)}{f(\mathbf{x}_2)},
provided f(x2)>0f(\mathbf{x}_2) > 0, where f(x2)f(\mathbf{x}_2) is the marginal density of X2\mathbf{X}_2.[45] For discrete cases, the conditional probability mass function follows analogously as p(x1x2)=p(x1,x2)p(x2)p(\mathbf{x}_1 \mid \mathbf{x}_2) = \frac{p(\mathbf{x}_1, \mathbf{x}_2)}{p(\mathbf{x}_2)}.[14] This normalization ensures that the conditional distribution integrates or sums to 1 over x1\mathbf{x}_1, preserving the axioms of probability while conditioning on observed or fixed values of X2\mathbf{X}_2.[43] A key property of these distributions is their role in simplifying multivariate analyses: marginals reduce dimensionality for individual component studies, while conditionals enable inference about one set of variables informed by another. For instance, in the bivariate normal distribution with mean vector μ\boldsymbol{\mu} and covariance matrix Σ\boldsymbol{\Sigma}, the marginal distribution of each component is univariate normal, specifically XiN(μi,Σii)X_i \sim \mathcal{N}(\mu_i, \Sigma_{ii}).[44] If the variables are independent, the conditional distributions coincide with the corresponding marginals.[45]

Transformations

Affine Transformations

An affine transformation of a multivariate random variable X\mathbf{X} is defined as Y=AX+b\mathbf{Y} = A \mathbf{X} + \mathbf{b}, where AA is a constant matrix and b\mathbf{b} is a constant vector.[4] This generalizes linear transformations by incorporating a location shift via b\mathbf{b}. The expected value of Y\mathbf{Y} transforms linearly as E[Y]=AE[X]+bE[\mathbf{Y}] = A E[\mathbf{X}] + \mathbf{b}, reflecting the linearity of expectation extended to vectors.[4] Similarly, the covariance matrix updates according to Cov(Y)=ACov(X)AT\operatorname{Cov}(\mathbf{Y}) = A \operatorname{Cov}(\mathbf{X}) A^T, which preserves the positive semi-definiteness of the covariance while accounting for the scaling and rotation induced by AA.[4] In general, the distribution of Y\mathbf{Y} is the pushforward measure of the distribution of X\mathbf{X} under the affine map, altering the support and density accordingly. For instance, if X\mathbf{X} follows a multivariate normal distribution N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}), then Y\mathbf{Y} also follows a multivariate normal N(Aμ+b,AΣAT)\mathcal{N}(A \boldsymbol{\mu} + \mathbf{b}, A \boldsymbol{\Sigma} A^T), demonstrating closure under affine transformations.[46] A key application is standardization, which centers and scales X\mathbf{X} to have mean zero and identity covariance: Z=Σ1/2(Xμ)\mathbf{Z} = \boldsymbol{\Sigma}^{-1/2} (\mathbf{X} - \boldsymbol{\mu}), where Σ\boldsymbol{\Sigma} is positive definite, yielding ZN(0,I)\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, I).[47] This transformation facilitates comparisons across dimensions by removing location and scale effects.

Invertible Mappings

In the context of multivariate random variables, an invertible mapping refers to a transformation Y=g(X)\mathbf{Y} = g(\mathbf{X}), where X=(X1,,Xn)\mathbf{X} = (X_1, \dots, X_n)^\top is an nn-dimensional random vector, g:RnRng: \mathbb{R}^n \to \mathbb{R}^n is a differentiable and bijective function with a differentiable inverse g1g^{-1}, and Y\mathbf{Y} is the transformed random vector.[48] Such mappings preserve the dimensionality of the random vector, ensuring that Y\mathbf{Y} remains nn-dimensional, and are fundamental for deriving the distribution of transformed variables from the original joint distribution.[10] For jointly continuous random vectors with joint probability density function (pdf) fX(x)f_{\mathbf{X}}(\mathbf{x}), the pdf of Y\mathbf{Y} under an invertible mapping is given by the change-of-variables formula:
fY(y)=fX(g1(y))detJg1(y), f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(g^{-1}(\mathbf{y})) \left| \det J_{g^{-1}}(\mathbf{y}) \right|,
where Jg1(y)J_{g^{-1}}(\mathbf{y}) is the Jacobian matrix of the inverse transformation, whose (i,j)(i,j)-th entry is (g1)iyj(y)\frac{\partial (g^{-1})_i}{\partial y_j}(\mathbf{y}), and detJg1(y)\det J_{g^{-1}}(\mathbf{y}) is its determinant.[48] This formula arises from the requirement that probabilities are preserved under the transformation, accounting for the local volume distortion measured by the absolute value of the Jacobian determinant.[49] Affine transformations represent a special linear case of such mappings, where the Jacobian is constant.[10] Invertible mappings play a key role in copula theory, where the dependence structure of a multivariate distribution is separated from its marginals via invertible transformations of the cumulative distribution functions (CDFs) to uniform marginals on [0,1]n[0,1]^n.[50] Specifically, Sklar's theorem expresses the joint CDF as F(x)=C(F1(x1),,Fn(xn))F(\mathbf{x}) = C(F_1(x_1), \dots, F_n(x_n)), with copula CC, and the corresponding density transformation incorporates the Jacobians of the marginal inverse CDFs to yield the copula density.[50] A classic example is the transformation from Cartesian to polar coordinates for a bivariate random vector (X,Y)(X, Y) with joint pdf fX,Y(x,y)f_{X,Y}(x,y), defined by R=X2+Y2R = \sqrt{X^2 + Y^2} and Θ=tan1(Y/X)\Theta = \tan^{-1}(Y/X), with inverse X=RcosΘX = R \cos \Theta, Y=RsinΘY = R \sin \Theta. The Jacobian determinant of the inverse is rr, so the joint pdf of (R,Θ)(R, \Theta) is fR,Θ(r,θ)=fX,Y(rcosθ,rsinθ)rf_{R,\Theta}(r, \theta) = f_{X,Y}(r \cos \theta, r \sin \theta) \cdot r for r>0r > 0 and θ[0,2π)\theta \in [0, 2\pi).[48] This transformation simplifies computations for rotationally symmetric distributions, such as the bivariate normal.[49]

Generating Functions

Characteristic Function

The characteristic function of a multivariate random vector XRk\mathbf{X} \in \mathbb{R}^k is defined as
ϕX(t)=E[eitTX], \phi_{\mathbf{X}}(\mathbf{t}) = E\left[e^{i \mathbf{t}^T \mathbf{X}}\right],
where i=1i = \sqrt{-1} and tRk\mathbf{t} \in \mathbb{R}^k.[51] If X\mathbf{X} admits a joint probability density function f(x)f(\mathbf{x}), this expectation can be expressed as the Fourier transform
ϕX(t)=RkeitTxf(x)dx.[](https://ocw.mit.edu/courses/6436jfundamentalsofprobabilityfall2018/87422049e8f05459ee857eb475f3a667MIT6436JF18lec15.pdf) \phi_{\mathbf{X}}(\mathbf{t}) = \int_{\mathbb{R}^k} e^{i \mathbf{t}^T \mathbf{x}} f(\mathbf{x}) \, d\mathbf{x}.[](https://ocw.mit.edu/courses/6-436j-fundamentals-of-probability-fall-2018/87422049e8f05459ee857eb475f3a667_MIT6_436JF18_lec15.pdf)
This function exists for any distribution, as the complex exponential is bounded, and serves as a generating tool that encodes the full probabilistic structure of X\mathbf{X}.[24] Key properties of the characteristic function include its continuity in t\mathbf{t} and the normalization ϕX(0)=1\phi_{\mathbf{X}}(\mathbf{0}) = 1, since ei0TX=1e^{i \mathbf{0}^T \mathbf{X}} = 1.[51] Moreover, by the uniqueness theorem, two random vectors have the same distribution if and only if their characteristic functions coincide for all t\mathbf{t}, enabling inversion to recover the distribution from ϕX(t)\phi_{\mathbf{X}}(\mathbf{t}).[52] Assuming the moments of X\mathbf{X} exist and are finite, higher-order moments can be extracted via differentiation at the origin. Specifically, the mean vector is
E[X]=itϕX(0), E[\mathbf{X}] = -i \nabla_{\mathbf{t}} \phi_{\mathbf{X}}(\mathbf{0}),
where t\nabla_{\mathbf{t}} denotes the gradient with respect to t\mathbf{t}.[51] The elements of the covariance matrix Σ\boldsymbol{\Sigma} are obtained from the second partial derivatives:
Cov(Xj,Xk)=2ϕX(t)tjtkt=0E[Xj]E[Xk],[](https://www.stat.cmu.edu/ arinaldo/Teaching/36752/S18/Notes/lecnotes11.pdf) \text{Cov}(X_j, X_k) = \left. -\frac{\partial^2 \phi_{\mathbf{X}}(\mathbf{t})}{\partial t_j \partial t_k} \right|_{\mathbf{t}=\mathbf{0}} - E[X_j] E[X_k],[](https://www.stat.cmu.edu/~arinaldo/Teaching/36752/S18/Notes/lec_notes_11.pdf)
with the full second-moment matrix given by E[XXT]=(i)2HϕX(0)E[\mathbf{X} \mathbf{X}^T] = (-i)^2 \mathbf{H} \phi_{\mathbf{X}}(\mathbf{0}), where H\mathbf{H} is the Hessian matrix.[52] A prominent example arises for the multivariate normal distribution with mean μ\boldsymbol{\mu} and covariance Σ\boldsymbol{\Sigma}, whose characteristic function is
ϕX(t)=exp(itTμ12tTΣt). \phi_{\mathbf{X}}(\mathbf{t}) = \exp\left(i \mathbf{t}^T \boldsymbol{\mu} - \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t}\right).
This closed form directly reflects the location and dispersion parameters, and differentiation confirms the moments match E[X]=μE[\mathbf{X}] = \boldsymbol{\mu} and Cov(X)=Σ\text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma}.[44]

Moment-Generating Function

The moment-generating function (MGF) of a multivariate random variable XRp\mathbf{X} \in \mathbb{R}^p is defined as $ M_{\mathbf{X}}(\mathbf{t}) = E[e^{\mathbf{t}^T \mathbf{X}}] $, where tRp\mathbf{t} \in \mathbb{R}^p.[53] This function is analytic in some neighborhood of the origin t=0\mathbf{t} = \mathbf{0} whenever it exists in that region, allowing for a power series expansion that facilitates the extraction of distributional properties.[54] The MGF relates directly to the moments and cumulants of X\mathbf{X} through its partial derivatives evaluated at 0\mathbf{0}. Specifically, the mean vector is given by the gradient MX(0)=E[X]\nabla M_{\mathbf{X}}(\mathbf{0}) = E[\mathbf{X}], while higher-order mixed partial derivatives yield the corresponding joint moments, such as $ E[X_i X_j] $ from the second partial with respect to $ t_i $ and $ t_j $ at 0\mathbf{0}.[55] If the MGF exists, it uniquely determines the distribution of X\mathbf{X}.[56] Key properties of the MGF include its behavior under addition of independent random vectors. If X\mathbf{X} and Y\mathbf{Y} are independent, then $ M_{\mathbf{X} + \mathbf{Y}}(\mathbf{t}) = M_{\mathbf{X}}(\mathbf{t}) M_{\mathbf{Y}}(\mathbf{t}) $, mirroring the convolution property of their densities.[57] For the multivariate normal distribution XNp(μ,Σ)\mathbf{X} \sim N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), the MGF takes the explicit form
MX(t)=exp(tTμ+12tTΣt), M_{\mathbf{X}}(\mathbf{t}) = \exp\left( \mathbf{t}^T \boldsymbol{\mu} + \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t} \right),
which exists for all t\mathbf{t} and confirms the distribution's moments via differentiation.[58]
However, the MGF does not exist for all multivariate distributions, particularly those with heavy tails lacking finite moments of all orders. For instance, the multivariate Cauchy distribution has no MGF, as its marginal components lead to divergent expectations in the defining integral.[59]

Advanced Properties

Quadratic Forms

A quadratic form of a multivariate random vector XRp\mathbf{X} \in \mathbb{R}^p is defined as Q=XTAXQ = \mathbf{X}^T A \mathbf{X}, where AA is a p×pp \times p symmetric matrix. This construction ensures QQ is a scalar random variable that captures second-order interactions among the components of X\mathbf{X}, playing a central role in multivariate analysis for tasks like deriving test statistics and measuring deviations from a hypothesized structure. If AA is not symmetric, the form can be adjusted to use the symmetrized matrix (A+AT)/2(A + A^T)/2 without altering the value.[60] For XNp(μ,Σ)\mathbf{X} \sim N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), the moments of QQ admit closed-form expressions. The expectation is given by
E[Q]=tr(AΣ)+μTAμ, E[Q] = \operatorname{tr}(A \boldsymbol{\Sigma}) + \boldsymbol{\mu}^T A \boldsymbol{\mu},
where tr()\operatorname{tr}(\cdot) denotes the matrix trace. The variance follows as
Var(Q)=2tr(AΣAΣ)+4μTAΣAμ. \operatorname{Var}(Q) = 2 \operatorname{tr}(A \boldsymbol{\Sigma} A \boldsymbol{\Sigma}) + 4 \boldsymbol{\mu}^T A \boldsymbol{\Sigma} A \boldsymbol{\mu}.
These results derive from the properties of multivariate normal moments and can be obtained using the expectation of products of quadratic forms.[61] A prominent example is the squared Mahalanobis distance, D2=(Xμ)TΣ1(Xμ)D^2 = (\mathbf{X} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}), which corresponds to the choice A=Σ1A = \boldsymbol{\Sigma}^{-1}. Under multivariate normality, D2χp2D^2 \sim \chi^2_p, a central chi-squared distribution with pp degrees of freedom, providing a scale-free measure of multivariate dispersion. This distribution facilitates probabilistic statements, such as prediction regions defined by D2χp,1α2D^2 \leq \chi^2_{p, 1-\alpha}.[62] Quadratic forms also underlie the Wishart distribution, which describes the sampling variability of covariance matrices. If X1,,Xn\iidNp(0,Σ)\mathbf{X}_1, \dots, \mathbf{X}_n \stackrel{\iid}{\sim} N_p(\mathbf{0}, \boldsymbol{\Sigma}), then the random matrix W=i=1nXiXiTW = \sum_{i=1}^n \mathbf{X}_i \mathbf{X}_i^T follows a Wishart distribution Wp(n,Σ)W_p(n, \boldsymbol{\Sigma}), generalizing the chi-squared to matrix-valued sums of outer products. The mean is E[W]=nΣE[W] = n \boldsymbol{\Sigma}, and it serves as the exact distribution for sample covariance matrices in centered normal samples.[63]

Isserlis' Theorem for Higher Moments

Isserlis' theorem provides a method for computing the higher-order moments of a multivariate normal random vector, particularly the expectation of the product of its components. For a centered (zero-mean) multivariate normal random vector X=(X1,,Xn)N(0,Σ)\mathbf{X} = (X_1, \dots, X_n)^\top \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}), the theorem states that if the product has an odd number of factors, the expectation is zero: E[k=12m+1Xik]=0E\left[\prod_{k=1}^{2m+1} X_{i_k}\right] = 0. For even orders, 2m2m, the expectation is the sum over all possible perfect matchings (pairings) of the indices, where each pairing contributes the product of the corresponding covariances:
E[k=12mXik]=π(j,l)πCov(Xij,Xil), E\left[\prod_{k=1}^{2m} X_{i_k}\right] = \sum_{\pi} \prod_{(j,l) \in \pi} \operatorname{Cov}(X_{i_j}, X_{i_l}),
with the sum taken over all (2m1)!!=(2m)!/(2mm!)(2m-1)!! = (2m)! / (2^m m!) distinct pairings π\pi.[64][65] In the general non-centered case, where XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) with μ0\boldsymbol{\mu} \neq \mathbf{0}, the full form of Isserlis' theorem extends to sums over all partitions of the index set into singletons and pairs. Each such partition contributes the product of the means for the singletons and the covariances for the pairs, weighted appropriately. This generalization, derived by Withers, allows computation of arbitrary moments by accounting for the location parameter μ\boldsymbol{\mu}.[66] The theorem was originally derived by Leon Isserlis in 1918 as a formula for product-moment coefficients of normal distributions. In physics, it is recognized as Wick's theorem, introduced by Gian-Carlo Wick in 1950 for evaluating time-ordered products of Gaussian fields, and has been applied in quantum field theory via Wick rotations to compute Feynman diagrams.[65] Isserlis' theorem facilitates the evaluation of multivariate Gaussian integrals and the determination of moments beyond the second order, which are otherwise intractable due to the complexity of the joint density. These applications are essential in statistical physics, signal processing, and deriving higher cumulants for asymptotic analyses.[67][66] For illustration, consider four components X1,X2,X3,X4X_1, X_2, X_3, X_4 of a centered multivariate normal X\mathbf{X}. The fourth moment is
E[X1X2X3X4]=Cov(X1,X2)Cov(X3,X4)+Cov(X1,X3)Cov(X2,X4)+Cov(X1,X4)Cov(X2,X3), E[X_1 X_2 X_3 X_4] = \operatorname{Cov}(X_1, X_2) \operatorname{Cov}(X_3, X_4) + \operatorname{Cov}(X_1, X_3) \operatorname{Cov}(X_2, X_4) + \operatorname{Cov}(X_1, X_4) \operatorname{Cov}(X_2, X_3),
corresponding to the three possible pairings.[64]

Applications

Portfolio Theory

In portfolio theory, the returns of multiple assets are modeled as a multivariate random vector R=(R1,,Rn)T\mathbf{R} = (R_1, \dots, R_n)^T, where each RiR_i denotes the random return of the ii-th asset, capturing their joint probabilistic behavior including dependencies via the covariance structure.[68] This setup allows for the analysis of portfolio performance under uncertainty, treating asset returns as realizations from a joint distribution. A portfolio is formed by allocating weights w=(w1,,wn)T\mathbf{w} = (w_1, \dots, w_n)^T to these assets, yielding the portfolio return Rp=wTRR_p = \mathbf{w}^T \mathbf{R}. The expected portfolio return is then E[Rp]=wTμ\mathbb{E}[R_p] = \mathbf{w}^T \boldsymbol{\mu}, where μ=E[R]\boldsymbol{\mu} = \mathbb{E}[\mathbf{R}] is the vector of individual expected returns. The portfolio variance, a key measure of risk, is given by
Var(Rp)=wTΣw, \operatorname{Var}(R_p) = \mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w},
where Σ\boldsymbol{\Sigma} is the covariance matrix of R\mathbf{R}. This quadratic form highlights how correlations between assets influence overall risk, enabling diversification benefits when covariances are low.[69] The foundational Markowitz model, introduced in 1952, formalizes portfolio optimization by minimizing variance subject to a target expected return wTμ=μp\mathbf{w}^T \boldsymbol{\mu} = \mu_p and the full investment constraint wT1=1\mathbf{w}^T \mathbf{1} = 1, where 1\mathbf{1} is a vector of ones. This mean-variance optimization problem is solved via quadratic programming, yielding the set of portfolios that offer the highest return for a given risk level or the lowest risk for a given return. The resulting collection of optimal portfolios traces the efficient frontier in the mean-variance plane, a hyperbolic curve representing undominated choices. Markowitz's approach revolutionized asset allocation by quantifying the risk-return tradeoff through multivariate moments.[69] For illustration, consider a two-asset portfolio with returns R1R_1 and R2R_2, weights w1+w2=1w_1 + w_2 = 1, expected returns μ1\mu_1 and μ2\mu_2, variances σ12\sigma_1^2 and σ22\sigma_2^2, and correlation ρ\rho. The portfolio variance simplifies to σp2=w12σ12+w22σ22+2w1w2ρσ1σ2\sigma_p^2 = w_1^2 \sigma_1^2 + w_2^2 \sigma_2^2 + 2 w_1 w_2 \rho \sigma_1 \sigma_2. When ρ<1\rho < 1, diversification reduces σp\sigma_p below the weighted average of individual variances, demonstrating how negative or low correlations lower risk without sacrificing return potential; for ρ=1\rho = 1, no diversification benefit exists. This case underscores the multivariate nature of risk in Markowitz's framework.[69]

Multivariate Regression

Multivariate regression, also known as multivariate multiple regression, extends the linear regression model to scenarios involving multiple response variables and multiple predictor variables, allowing for the joint modeling of correlated outcomes.[70] The core model is expressed in matrix form as Y=XB+E\mathbf{Y} = X \mathbf{B} + \boldsymbol{\Epsilon}, where Y\mathbf{Y} is an n×qn \times q matrix of response variables with nn observations and qq responses, XX is an n×pn \times p design matrix of predictors (including an intercept column), B\mathbf{B} is a p×qp \times q coefficient matrix, and E\boldsymbol{\Epsilon} is an n×qn \times q error matrix whose rows are independent and follow a multivariate normal distribution ϵiNq(0,Σ)\boldsymbol{\epsilon}_i \sim \mathcal{N}_q(\mathbf{0}, \boldsymbol{\Sigma}), with Σ\boldsymbol{\Sigma} being the q×qq \times q covariance matrix of the errors.[70] This setup accounts for correlations among the response variables, providing more efficient estimates and inferences compared to fitting separate univariate regressions.[71] The estimation of B\mathbf{B} is typically performed using ordinary least squares (OLS), yielding the estimator B^=(XTX)1XTY\hat{\mathbf{B}} = (X^T X)^{-1} X^T \mathbf{Y}, which minimizes the sum of squared residuals in the Frobenius norm and coincides with the maximum likelihood estimator under multivariate normality assumptions.[70] This OLS estimator remains efficient even when Σ\boldsymbol{\Sigma} involves off-diagonal correlations, as the structure of the model leads to the same point estimates regardless of whether Σ\boldsymbol{\Sigma} is known or diagonal (implying independent errors across responses).[70] The covariance matrix of B^\hat{\mathbf{B}} is then estimated as Cov^(B^)=(XTX)1Σ^\widehat{\text{Cov}}(\hat{\mathbf{B}}) = (X^T X)^{-1} \otimes \hat{\boldsymbol{\Sigma}}, where Σ^=1npETE\hat{\boldsymbol{\Sigma}} = \frac{1}{n-p} \mathbf{E}^T \mathbf{E} is the unbiased estimator of the error covariance matrix based on the residuals E=YXB^\mathbf{E} = \mathbf{Y} - X \hat{\mathbf{B}}.[70] Hypothesis testing in multivariate regression often employs multivariate analysis of variance (MANOVA) techniques to assess whether subsets of coefficients in B\mathbf{B} differ from specified values, such as testing overall model significance or group differences in predictors.[71] A key test statistic is Wilks' lambda (Λ\Lambda), defined as Λ=Σ~Σ~1\Lambda = \frac{|\tilde{\boldsymbol{\Sigma}}|}{|\tilde{\boldsymbol{\Sigma}}_1|}, where Σ~\tilde{\boldsymbol{\Sigma}} is the maximum likelihood estimate of Σ\boldsymbol{\Sigma} under the full model and Σ~1\tilde{\boldsymbol{\Sigma}}_1 is the estimate under the null hypothesis (e.g., certain rows of B\mathbf{B} are zero); small values of Λ\Lambda indicate rejection of the null.[70] Under the null, νlogΛ- \nu \log \Lambda approximately follows a chi-squared distribution with degrees of freedom m(pq)m(p - q), where ν=np112(mp+q+1)\nu = n - p - 1 - \frac{1}{2}(m - p + q + 1), m=qm = q, and p,qp, q reflect the dimensions of the tested subspace; this likelihood ratio test provides a unified framework for multivariate hypotheses.[70] Wilks' lambda was introduced by Samuel S. Wilks in 1932 as a criterion for testing equality of multivariate means, later adapted for regression contexts.[72] A practical example involves predicting sales across multiple channels (e.g., online, retail, and wholesale) as response variables based on advertising expenditures in various media (e.g., TV, digital, print) as predictors; the model captures how ad spends jointly influence correlated sales outcomes, with MANOVA testing whether the coefficients differ significantly from zero across channels.[73] The foundations of multivariate regression trace back to the 1930s, with Harold Hotelling's pioneering extensions of univariate methods to multivariate settings, including his 1936 work on relations between sets of variates that laid groundwork for joint modeling of multiple responses and predictors.[74]

References

User Avatar
No comments yet.