Hubbry Logo
Linear modelLinear modelMain
Open search
Linear model
Community hub
Linear model
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Linear model
Linear model
from Wikipedia

In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term is also used in time series analysis with a different meaning. In each case, the designation "linear" is used to identify a subclass of models for which substantial reduction in the complexity of the related statistical theory is possible.

Linear regression models

[edit]

For the regression case, the statistical model is as follows. Given a (random) sample the relation between the observations and the independent variables is formulated as

where may be nonlinear functions. In the above, the quantities are random variables representing errors in the relationship. The "linear" part of the designation relates to the appearance of the regression coefficients, in a linear way in the above relationship. Alternatively, one may say that the predicted values corresponding to the above model, namely

are linear functions of the .

Given that estimation is undertaken on the basis of a least squares analysis, estimates of the unknown parameters are determined by minimising a sum of squares function

From this, it can readily be seen that the "linear" aspect of the model means the following:

  • the function to be minimised is a quadratic function of the for which minimisation is a relatively simple problem;
  • the derivatives of the function are linear functions of the making it easy to find the minimising values;
  • the minimising values are linear functions of the observations ;
  • the minimising values are linear functions of the random errors which makes it relatively easy to determine the statistical properties of the estimated values of .

Time series models

[edit]

An example of a linear time series model is an autoregressive moving average model. Here the model for values {} in a time series can be written in the form

where again the quantities are random variables representing innovations which are new random effects that appear at a certain time but also affect values of at later times. In this instance the use of the term "linear model" refers to the structure of the above relationship in representing as a linear function of past values of the same time series and of current and past values of the innovations.[1] This particular aspect of the structure means that it is relatively simple to derive relations for the mean and covariance properties of the time series. Note that here the "linear" part of the term "linear model" is not referring to the coefficients and , as it would be in the case of a regression model, which looks structurally similar.

Other uses in statistics

[edit]

There are some other instances where "nonlinear model" is used to contrast with a linearly structured model, although the term "linear model" is not usually applied. One example of this is nonlinear dimensionality reduction.

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A linear model in statistics is a framework for modeling the relationship between a response variable and one or more predictor variables, assuming that the expected value of the response is a of the predictors, typically expressed in matrix form as Y=Xβ+ϵY = X\beta + \epsilon, where YY is the n×1n \times 1 vector of observed responses, XX is the n×pn \times p incorporating the predictors, β\beta is the p×1p \times 1 vector of unknown parameters, and ϵ\epsilon is the n×1n \times 1 vector of random errors with mean zero. The origins of linear models trace back to the late , when Sir developed the concept of regression while studying hereditary traits in sweet peas, introducing the idea of a linear relationship tending toward the , which Karl later formalized in the early 20th century through the development of the product-moment correlation and multiple regression techniques. Linear models encompass a wide range of applications, including simple and multiple linear regression for prediction and , analysis of variance (ANOVA) for comparing group means, and analysis of covariance (ANCOVA) for adjusting means across covariates. Under the Gauss-Markov assumptions—where errors have zero mean, constant variance σ2\sigma^2, and are uncorrelated—the ordinary least squares estimator of β\beta is the best linear unbiased estimator (BLUE), providing efficient parameter estimates via the solution to the normal equations XXβ^=XYX'X\hat{\beta} = X'Y. Extensions include generalized least squares for heteroscedastic or correlated errors, as in the Aitken model where cov(ϵ)=σ2V\text{cov}(\epsilon) = \sigma^2 V with known VV, and further generalizations to linear mixed models incorporating random effects for clustered or hierarchical data. These models are foundational in fields like economics, biology, and social sciences, enabling hypothesis testing via F-statistics and confidence intervals under normality assumptions.

Basic Concepts

Definition and Scope

In , a linear model describes the relationship between a dependent variable and one or more independent variables as a of the parameters, meaning the of the dependent variable is a linear combination of the independent variables weighted by unknown coefficients. This linearity pertains specifically to the parameters rather than the variables themselves, allowing transformations of the variables (such as logarithms or polynomials) to maintain the linear structure in the coefficients. A example is the model, where the dependent variable YY is modeled as Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon, with β0\beta_0 and β1\beta_1 as the intercept and slope parameters, XX as the independent variable, and ϵ\epsilon as a random term representing unexplained variation. This formulation emphasizes the additivity of effects, where the influence of each independent variable contributes independently to the outcome, aligning with the that permits combining solutions through scaling and . The scope of linear models encompasses a broad range of applications in , including , of variance, and experimental , primarily for predicting outcomes and inferences about relationships in fields such as , social sciences, and . Unlike nonlinear models, where conditional and marginal effects may diverge, linear models ensure these effects coincide, simplifying interpretation and enabling properties like homogeneity and additivity that support scalable solutions. A key advantage is that linearity in parameters facilitates closed-form solutions for parameter estimation, making them computationally efficient and analytically tractable compared to nonlinear alternatives requiring iterative methods.

Historical Background

The development of linear models traces its roots to the early 19th century, when astronomers and mathematicians sought methods to fit observational data amid measurement errors. In 1805, introduced the method of in his work Nouvelles méthodes pour la détermination des orbites des comètes, applying it to minimize the sum of squared residuals for predicting comet orbits based on astronomical observations. This deterministic approach marked a foundational step in handling overdetermined systems. Four years later, in 1809, published Theoria motus corporum coelestium in sectionibus conicis solem ambientum, where he claimed prior use of since 1795 and provided a probabilistic justification by linking it to the normal distribution of errors, establishing it as a maximum-likelihood estimator under Gaussian assumptions. The concept of regression emerged in the late 19th century through studies of inheritance patterns. In 1886, Francis Galton coined the term "regression" in his paper "Regression Towards Mediocrity in Hereditary Stature," published in the Journal of the Anthropological Institute, while analyzing height data from parents and children. Galton observed that extreme parental heights tended to produce offspring heights closer to the population average, introducing the idea of linear relationships between variables and laying the groundwork for bivariate regression as a tool in biometrics. This work shifted focus from mere curve fitting to modeling dependencies, influencing subsequent statistical applications in natural sciences. Building on Galton's ideas, formalized key aspects of in the late 19th and early 20th centuries. In 1895, Pearson developed the to quantify the strength of linear relationships between variables. He further extended this to multiple regression techniques around 1900–1910, enabling the modeling of a dependent variable against several predictors, which provided a mathematical foundation for broader applications in and beyond. In the 1920s, Ronald A. Fisher advanced linear models into a unified framework for experimental design and analysis. Working at the Rothamsted Experimental Station, Fisher developed analysis of variance (ANOVA) in his 1925 book Statistical Methods for Research Workers, extending to partition variance in designed experiments, such as agricultural trials. By the early 1930s, in works like (1935), Fisher synthesized regression, ANOVA, and covariance analysis into the general linear model, incorporating probabilistic terms to enable from sample . This evolution transformed linear models from deterministic tools to probabilistic frameworks essential for hypothesis testing in experimental sciences. The 1930s saw Jerzy Neyman and Egon Pearson formalize inference procedures for linear models through their Neyman-Pearson lemma, introduced in the 1933 paper "On the Problem of the Most Efficient Tests of Statistical Hypotheses" in Philosophical Transactions of the Royal Society. Their framework emphasized controlling error rates (Type I and Type II) and power in hypothesis testing, providing a rigorous basis for applying linear models to decision-making under uncertainty. Post-World War II computational advances, including early electronic computers like the ENIAC (1945) and statistical software developments in the 1950s, facilitated the widespread adoption of these methods by enabling efficient matrix computations for large datasets. This period marked the transition of linear models from theoretical constructs to practical tools in fields like economics and social sciences.

Mathematical Formulation

General Linear Model Equation

The general linear model expresses the relationship between a response variable and one or more predictor variables as a of parameters plus an term. In its scalar form, for each i=1,,ni = 1, \dots, n, the model is given by Yi=β0+β1Xi1++βp1Xi,p1+εi,Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_{p-1} X_{i,p-1} + \varepsilon_i, where YiY_i is the observed response, β0\beta_0 is the intercept, βj\beta_j (for j=1,,p1j = 1, \dots, p-1) are the slope coefficients associated with the predictors XijX_{ij}, and εi\varepsilon_i represents the random for the iith . This formulation can be compactly represented in vector-matrix notation as Y=Xβ+ε,\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where Y\mathbf{Y} is an n×1n \times 1 vector of responses, X\mathbf{X} is an n×pn \times p design matrix with rows corresponding to observations, β\boldsymbol{\beta} is a p×1p \times 1 vector of parameters (including the intercept), and ε\boldsymbol{\varepsilon} is an n×1n \times 1 vector of errors. The errors εi\varepsilon_i are typically assumed to be independently and identically distributed as normal with mean zero and constant variance σ2\sigma^2, though full details on these assumptions appear in the relevant section. In this model, each coefficient βj\beta_j (for j1j \geq 1) represents the partial effect of the jjth predictor on the response, interpreted as the expected change in YY for a one-unit increase in XjX_j, holding all other predictors constant. The X\mathbf{X} structures the predictors for estimation; its first column consists entirely of 1s to accommodate the intercept term β0\beta_0. Categorical predictors are incorporated by creating dummy variables, where each category (except one reference category) is represented by a binary column in X\mathbf{X} to avoid multicollinearity.

Matrix Representation

The XX serves as the foundational structure in the matrix representation of linear models, typically an n×pn \times p matrix where nn denotes the number of observations and pp the number of parameters (including the intercept). Its rows represent individual observations, while columns correspond to the predictor variables. For the model to be identifiable and to ensure unique parameter estimates, XX must have full column rank, meaning its rank equals pp, which implies that the columns are linearly independent and there is no perfect . This full column rank condition guarantees the invertibility of the matrix XXX^\top X, a key property that enables the explicit solution for model parameters. The parameter vector β\beta, a p×1p \times 1 column vector containing the coefficients, is estimated in matrix form via the ordinary least squares (OLS) estimator β^=(XX)1XY\hat{\beta} = (X^\top X)^{-1} X^\top Y, where YY is the n×1n \times 1 response vector; this form previews the efficient algebraic solution without requiring iterative methods, with full details on its derivation provided in the section on ordinary least squares estimation. A central element in this representation is the H=X(XX)1XH = X (X^\top X)^{-1} X^\top, often termed the hat matrix, which orthogonally projects the response vector YY onto the column space of XX to yield the fitted values Y^=HY\hat{Y} = H Y. This matrix HH is symmetric (H=HH^\top = H) and idempotent (H2=HH^2 = H), properties that reflect its role as an orthogonal projection operator and facilitate analytical manipulations such as variance computations. Complementing HH is the residual maker matrix M=InHM = I_n - H, where InI_n is the n×nn \times n , which produces the residuals e=MYe = M Y by projecting YY onto the orthogonal complement of the column space of XX. The matrix MM annihilates XX such that MX=0M X = 0, ensuring residuals are uncorrelated with the predictors in the column space, and it is also symmetric (M=MM^\top = M) and idempotent (M2=MM^2 = M). These properties underscore MM's utility in isolating deviations unexplained by the model. The matrix formulation offers substantial computational advantages, particularly for large datasets, as it leverages optimized linear algebra algorithms in statistical software to perform operations like matrix inversion and efficiently, scaling to high-dimensional problems where scalar-based approaches would be prohibitive.

Assumptions and Diagnostics

Core Assumptions

The validity of the linear model, particularly in the context of ordinary least squares (OLS) estimation and , hinges on several foundational assumptions that ensure the model's parameters are unbiased, consistent, and efficient. These assumptions pertain to the relationship between the response variable YY and predictors XX, as well as the properties of the error terms ϵ\epsilon. Violations can lead to biased estimates or invalid inference, though some robustness holds under large samples via the (CLT). The core assumptions are , , homoscedasticity, normality, no perfect , and exogeneity. Linearity requires that the conditional expectation of the response variable is a of the predictors, expressed as E(YX)=XβE(Y \mid X) = X\beta, where β\beta is the vector of parameters. This assumption implies that the effects of the predictors on the mean response are additive and that the relationship is straight-line in the parameters, holding the other variables fixed. It does not necessitate a linear relationship in the but rather in the ; nonlinearities can be addressed through transformations or additional terms, but the core model form must satisfy this for OLS to yield unbiased estimates. Independence assumes that the error terms ϵi\epsilon_i for different observations are statistically independent, meaning no correlation between residuals across observations, such as in time series where autocorrelation might occur. This ensures that the variance-covariance matrix of the errors is diagonal, supporting the unbiasedness of OLS estimators under the Gauss-Markov theorem. Independence is crucial for the standard errors and hypothesis tests to be valid, as dependence can inflate or deflate them. Homoscedasticity stipulates that the variance of the errors is constant across all levels of the predictors, i.e., Var(ϵiX)=σ2\text{Var}(\epsilon_i \mid X) = \sigma^2 for some constant σ2\sigma^2. This equal spread of residuals prevents heteroscedasticity, where variance changes with XX, which could lead to inefficient OLS estimates and unreliable standard errors. Under this assumption, combined with exogeneity, OLS achieves the best linear unbiased estimator () property. Normality posits that the errors follow a normal distribution, ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2), which is necessary for exact finite-sample inference, such as t-tests and F-tests on coefficients. However, this assumption is not required for consistency or unbiasedness of OLS; for large samples, the CLT ensures asymptotic normality of the estimators, making inference approximately valid even without it. Normality primarily affects the distribution of test statistics in small samples. No perfect multicollinearity requires that the predictors are not linearly dependent, meaning the design matrix XX has full column rank, so no predictor is an exact linear combination of others. This ensures the parameter matrix (XTX)1(\mathbf{X}^T\mathbf{X})^{-1} exists and is unique, preventing infinite or undefined OLS estimates. While mild multicollinearity is tolerable, perfect collinearity renders coefficients non-identifiable. Exogeneity, or the zero conditional mean assumption, states that the errors are uncorrelated with the predictors, E(ϵiX)=0E(\epsilon_i \mid X) = 0, implying no omitted variables or endogeneity biasing the estimates. This strict exogeneity ensures OLS estimators are unbiased and consistent, as any correlation would violate the condition essential for projection-based .

Violation Detection and Remedies

Detecting violations of the linear model's assumptions is essential for ensuring the validity of inferences drawn from the analysis. Post-estimation diagnostics primarily focus on the residuals, which represent the differences between observed and predicted values. These tools help identify departures from , independence, homoscedasticity, and normality, allowing practitioners to assess model adequacy before proceeding to remedies. To check for linearity, residual plots of residuals against fitted values are commonly used; a random scatter around zero indicates the assumption holds, while patterns such as curves or funnels suggest nonlinearity. For independence, particularly in time series contexts, the Durbin-Watson test detects first-order by computing a statistic that compares adjacent residuals; values near 2 indicate no autocorrelation, while deviations toward 0 or 4 signal positive or negative autocorrelation, respectively. Heteroscedasticity is assessed via the Breusch-Pagan test, which regresses squared residuals on the predictors and tests the significance of the resulting coefficients under a chi-squared distribution; a significant result rejects constant variance. Normality of residuals is evaluated using quantile-quantile (Q-Q) plots, where points aligning closely with a straight line support the assumption, and deviations indicate skewness or heavy tails. Multicollinearity among predictors can inflate variance estimates and destabilize , even if other assumptions hold. The (VIF) measures this by quantifying how much the variance of a is increased due to with other predictors; for each predictor, VIF is computed as 1 over (1 - R²) from regressing it on the others, with values exceeding 10 signaling severe multicollinearity requiring attention. Outliers and influential points can disproportionately affect model fit and parameter estimates. Cook's distance identifies influential observations by measuring the change in fitted values when a single data point is removed; values greater than 4/n (where n is the sample size) or exceeding an F-threshold flag potential issues, combining leverage and residual magnitude. Leverage plots, based on hat values from the , highlight high-leverage points that lie far from the of predictors. Once violations are detected, targeted remedies can restore assumption validity without discarding the linear framework. For heteroscedasticity, logarithmic or Box-Cox transformations stabilize variance by adjusting the scale of the response or predictors; the Box-Cox family, parameterized by λ, applies y^λ for λ ≠ 0 or log(y) for λ = 0, with maximum likelihood estimating the optimal λ to achieve homoscedasticity. Robust standard errors, such as those proposed by , adjust inference by estimating the accounting for heteroscedasticity, providing consistent standard errors without altering coefficients. Autocorrelation, often in temporal data, can be addressed by including lagged dependent variables as predictors, effectively modeling the serial dependence and reducing residual correlation. For multicollinearity, ridge regression introduces a penalty term (λ times the sum of squared coefficients) to the objective, shrinking estimates toward zero and stabilizing them in correlated predictor spaces; λ is tuned via cross-validation or ridge traces. Outliers may be handled by removing influential points identified via Cook's distance if they are verifiable errors, or by robust regression methods that downweight them, though sensitivity analyses are recommended to confirm robustness. Transformations like those in the Box-Cox framework can also mitigate multiple violations simultaneously by improving overall distributional properties.

Estimation and Inference

Ordinary Least Squares Estimation

The ordinary least squares (OLS) estimation method seeks to find the parameter vector β^\hat{\beta} that minimizes the sum of squared residuals, given by i=1n(yixiβ^)2\sum_{i=1}^n (y_i - \mathbf{x}_i' \hat{\beta})^2, where yiy_i is the observed response and xiβ^\mathbf{x}_i' \hat{\beta} is the predicted value. This objective function measures the total deviation between observed and fitted values, weighted equally by the square of each residual to penalize larger errors more heavily. To derive the OLS estimator, differentiate the sum of squared residuals with respect to β\beta and set the result to zero, yielding the normal equations XXβ=XyX'X \beta = X'y, where XX is the and yy is the response vector. Assuming XXX'X is invertible (which requires no perfect ), the solution is β^=(XX)1Xy\hat{\beta} = (X'X)^{-1} X'y. In , as detailed in the Mathematical Formulation section, this projection of yy onto the column space of XX ensures the residuals are orthogonal to the predictors. Under the assumptions of linearity in parameters, strict exogeneity of regressors, and no perfect , the OLS estimator is unbiased, satisfying E[β^]=βE[\hat{\beta}] = \beta. Furthermore, by the Gauss-Markov , under the additional assumptions of homoscedasticity of errors and uncorrelated errors, β^\hat{\beta} is the best linear unbiased (BLUE), possessing the minimum variance among all linear unbiased estimators of β\beta. The variance-covariance matrix of the is Var(β^)=σ2(XX)1\operatorname{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1}, where σ2\sigma^2 is the error variance, highlighting how the precision of β^\hat{\beta} improves with more informative in XX. OLS estimation is widely implemented in statistical software for practical application. In R, the lm() function from the base stats package fits linear models using OLS by default, accepting a formula interface for specifying predictors and responses. In Python, the statsmodels library provides the OLS class in statsmodels.regression.linear_model, which computes β^\hat{\beta} and related statistics via methods like fit().

Hypothesis Testing and Confidence Intervals

In the general linear model, hypothesis testing provides a framework for assessing the statistical significance of the estimated parameters, typically using the ordinary least squares (OLS) estimator. For individual coefficients, the null hypothesis H0:βj=0H_0: \beta_j = 0 tests whether the j-th predictor has no linear association with the response variable, assuming the model is otherwise correctly specified. The test statistic is the t-statistic, given by t=β^jSE(β^j)t = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}, where β^j\hat{\beta}_j is the OLS estimate and SE(β^j)\text{SE}(\hat{\beta}_j) is its standard error, derived from the variance-covariance matrix of the estimator under the model's assumptions. This standard error is SE(β^j)=σ^2(XTX)jj1\text{SE}(\hat{\beta}_j) = \sqrt{ \hat{\sigma}^2 (X^T X)^{-1}_{jj} }
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.