Respect all members: no insults, harassment, or hate speech.
Be tolerant of different viewpoints, cultures, and beliefs. If you do not agree with others, just create separate note, article or collection.
Clearly distinguish between personal opinion and fact.
Verify facts before posting, especially when writing about history, science, or statistics.
Promotional content must be published on the “Related Services and Products” page—no more than one paragraph per service. You can also create subpages under the “Related Services and Products” page and publish longer promotional text there.
Do not post materials that infringe on copyright without permission.
Always credit sources when sharing information, quotes, or media.
Be respectful of the work of others when making changes.
Discuss major edits instead of removing others' contributions without reason.
If you notice rule-breaking, notify community about it in talks.
Do not share personal data of others without their consent.
Linear regression is also a type of machine learningalgorithm, more specifically a supervised algorithm, that learns from the labelled datasets and maps the data points to the most optimized linear functions that can be used for prediction on new datasets.[3]
Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications.[4] This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the following two broad categories:
If the goal is error i.e. variance reduction in prediction or forecasting, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.
Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Use of the Mean Squared Error (MSE) as the cost on a dataset that has many large outliers, can result in a model that fits the outliers more than the true data due to the higher importance assigned by MSE to large errors. So, cost functions that are robust to outliers should be used if the dataset has many large outliers. Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms "least squares" and "linear model" are closely linked, they are not synonymous.
In linear regression, the observations (red) are assumed to be the result of random deviations (green) from an underlying relationship (blue) between a dependent variable (y) and an independent variable (x).
Given a data set of nstatistical units, a linear regression model assumes that the relationship between the dependent variable y and the vector of regressors x is linear. This relationship is modeled through a disturbance term or error variableε—an unobserved random variable that adds "noise" to the linear relationship between the dependent variable and regressors. Thus the model takes the formwhere T denotes the transpose, so that xiTβ is the inner product between vectorsxi and β.
Often these n equations are stacked together and written in matrix notation as
is a vector of observed values of the variable called the regressand, endogenous variable, response variable, target variable, measured variable, criterion variable, or dependent variable. This variable is also sometimes known as the predicted variable, but this should not be confused with predicted values, which are denoted . The decision as to which variable in a data set is modeled as the dependent variable and which are modeled as the independent variables may be based on a presumption that the value of one of the variables is caused by, or directly influenced by the other variables. Alternatively, there may be an operational reason to model one of the variables in terms of the others, in which case there need be no presumption of causality.
Usually a constant is included as one of the regressors. In particular, for . The corresponding element of β is called the intercept. Many statistical inference procedures for linear models require an intercept to be present, so it is often included even if theoretical considerations suggest that its value should be zero.
Sometimes one of the regressors can be a non-linear function of another regressor or of the data values, as in polynomial regression and segmented regression. The model remains linear as long as it is linear in the parameter vector β.
The values xij may be viewed as either observed values of random variablesXj or as fixed values chosen prior to observing the dependent variable. Both interpretations may be appropriate in different cases, and they generally lead to the same estimation procedures; however different approaches to asymptotic analysis are used in these two situations.
is a -dimensional parameter vector, where is the intercept term (if one is included in the model—otherwise is p-dimensional). Its elements are known as effects or regression coefficients (although the latter term is sometimes reserved for the estimated effects). In simple linear regression, p=1, and the coefficient is known as regression slope. Statistical estimation and inference in linear regression focuses on β. The elements of this parameter vector are interpreted as the partial derivatives of the dependent variable with respect to the various independent variables.
is a vector of values . This part of the model is called the error term, disturbance term, or sometimes noise (in contrast with the "signal" provided by the rest of the model). This variable captures all other factors which influence the dependent variable y other than the regressors x. The relationship between the error term and the regressors, for example their correlation, is a crucial consideration in formulating a linear regression model, as it will determine the appropriate estimation method.
Fitting a linear model to a given data set usually requires estimating the regression coefficients such that the error term is minimized. For example, it is common to use the sum of squared errors as a measure of for minimization.
Consider a situation where a small ball is being tossed up in the air and then we measure its heights of ascent hi at various moments in time ti. Physics tells us that, ignoring the drag, the relationship can be modeled as
where β1 determines the initial velocity of the ball, β2 is proportional to the standard gravity, and εi is due to measurement errors. Linear regression can be used to estimate the values of β1 and β2 from the measured data. This model is non-linear in the time variable, but it is linear in the parameters β1 and β2; if we take regressors xi = (xi1, xi2) = (ti, ti2), the model takes on the standard form
Standard linear regression models with standard estimation techniques make a number of assumptions about the predictor variables, the response variable and their relationship. Numerous extensions have been developed that allow each of these assumptions to be relaxed (i.e. reduced to a weaker form), and in some cases eliminated entirely. Generally these extensions make the estimation procedure more complex and time-consuming, and may also require more data in order to produce an equally precise model.[citation needed]
Example of a cubic polynomial regression, which is a type of linear regression. Although polynomial regression fits a curve model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.
The following are the major assumptions made by standard linear regression models with standard estimation techniques (e.g. ordinary least squares):
Weak exogeneity. This essentially means that the predictor variables x can be treated as fixed values, rather than random variables. This means, for example, that the predictor variables are assumed to be error-free—that is, not contaminated with measurement errors. Although this assumption is not realistic in many settings, dropping it leads to significantly more difficult errors-in-variables models.
Linearity. This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in polynomial regression, which uses linear regression to fit the response variable as an arbitrary polynomial function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have "too much power", in that they tend to overfit the data. As a result, some kind of regularization must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are ridge regression and lasso regression. Bayesian linear regression can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, ridge regression and lasso regression can both be viewed as special cases of Bayesian linear regression, with particular types of prior distributions placed on the regression coefficients.)
Visualization of heteroscedasticity in a scatter plot against 100 random fitted values using MatlabConstant variance (a.k.a. homoscedasticity). This means that the variance of the errors does not depend on the values of the predictor variables. Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be $100,000 may easily have an actual income of $80,000 or $120,000—i.e., a standard deviation of around $20,000—while another person with a predicted income of $10,000 is unlikely to have the same $20,000 standard deviation, since that would imply their actual income could vary anywhere between −$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called heteroscedasticity. In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see Heteroscedasticity. The presence of heteroscedasticity will result in an overall "average" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of ordinary least squares, not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The mean squared error for the model will also be wrong. Various estimation techniques including weighted least squares and the use of heteroscedasticity-consistent standard errors can handle heteroscedasticity in a quite general way. Bayesian linear regression techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the logarithm of the response variable using a linear regression model, which implies that the response variable itself has a log-normal distribution rather than a normal distribution).
To check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as autocorrelation in the errors or their correlation with one or more covariates.
Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other. (Actual statistical independence is a stronger condition than mere lack of correlation and is often not needed, although it can be exploited if it is known to hold.) Some methods such as generalized least squares are capable of handling correlated errors, although they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors. Bayesian linear regression is a general way of handling this issue.
Lack of perfect multicollinearity in the predictors. For standard least squares estimation methods, the design matrix X must have full column rankp; otherwise perfect multicollinearity exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. This can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see Variance inflation factor). In the case of perfect multicollinearity, the parameter vector β will be non-identifiable—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space Rp). See partial least squares regression. Methods for fitting linear models with multicollinearity have been developed,[5][6][7][8] some of which require additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in generalized linear models, do not suffer from this problem.
Violations of these assumptions can result in biased estimations of β, biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods:
The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.
The arrangement, or probability distribution of the predictor variables x has a major influence on the precision of estimates of β. Sampling and design of experiments are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of β.
The data sets in the Anscombe's quartet are designed to have approximately the same linear regression line (as well as nearly identical means, standard deviations, and correlations) but are graphically very different. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables.
A fitted linear regression model can be used to identify the relationship between a single predictor variable xj and the response variable y when all the other predictor variables in the model are "held fixed". Specifically, the interpretation of βj is the expected change in y for a one-unit change in xj when the other covariates are held fixed—that is, the expected value of the partial derivative of y with respect to xj. This is sometimes called the unique effect of xj on y. In contrast, the marginal effect of xj on y can be assessed using a correlation coefficient or simple linear regression model relating only xj to y; this effect is the total derivative of y with respect to xj.
Care must be taken when interpreting regression results, as some of the regressors may not allow for marginal changes (such as dummy variables, or the intercept term), while others cannot be held fixed (recall the example from the introduction: it would be impossible to "hold ti fixed" and at the same time change the value of ti2).
It is possible that the unique effect be nearly zero even when the marginal effect is large. This may imply that some other covariate captures all the information in xj, so that once that variable is in the model, there is no contribution of xj to the variation in y. Conversely, the unique effect of xj can be large while its marginal effect is nearly zero. This would happen if the other covariates explained a great deal of the variation of y, but they mainly explain variation in a way that is complementary to what is captured by xj. In this case, including the other variables in the model reduces the part of the variability of y that is unrelated to xj, thereby strengthening the apparent relationship with xj.
The meaning of the expression "held fixed" may depend on how the values of the predictor variables arise. If the experimenter directly sets the values of the predictor variables according to a study design, the comparisons of interest may literally correspond to comparisons among units whose predictor variables have been "held fixed" by the experimenter. Alternatively, the expression "held fixed" can refer to a selection that takes place in the context of data analysis. In this case, we "hold a variable fixed" by restricting our attention to the subsets of the data that happen to have a common value for the given predictor variable. This is the only interpretation of "held fixed" that can be used in an observational study.
The notion of a "unique effect" is appealing when studying a complex system where multiple interrelated components influence the response variable. In some cases, it can literally be interpreted as the causal effect of an intervention that is linked to the value of a predictor variable. However, it has been argued that in many cases multiple regression analysis fails to clarify the relationships between the predictor variables and the response variable when the predictors are correlated with each other and are not assigned following a study design.[9]
The simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression (not to be confused with multivariate linear regression).[10]
Multiple linear regression is a generalization of simple linear regression to the case of more than one independent variable, and a special case of general linear models, restricted to one dependent variable. The basic model for multiple linear regression is
for each observation .
In the formula above we consider n observations of one dependent variable and p independent variables. Thus, Yi is the ith observation of the dependent variable, Xij is ith observation of the jth independent variable, j = 1, 2, ..., p. The values βj represent parameters to be estimated, and εi is the ith independent identically distributed normal error.
In the more general multivariate linear regression, there is one equation of the above form for each of m > 1 dependent variables that share the same set of explanatory variables and hence are estimated simultaneously with each other:
for all observations indexed as i = 1, ... , n and for all dependent variables indexed as j = 1, ... , m.
Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term, multivariate linear regression, refers to cases where y is a vector, i.e., the same as general linear regression.
Model Assumptions to Check:
1. Linearity: Relationship between each predictor and outcome must be linear
2. Normality of residuals: Residuals should follow a normal distribution
3. Homoscedasticity: Constant variance of residuals across predicted values
4. Independence: Observations should be independent (not repeated measures)
SPSS: Use partial plots, histograms, P-P plots, residual vs. predicted plots
The general linear model considers the situation when the response variable is not a scalar (for each observation) but a vector, yi. Conditional linearity of is still assumed, with a matrix B replacing the vector β of the classical linear regression model. Multivariate analogues of ordinary least squares (OLS) and generalized least squares (GLS) have been developed. "General linear models" are also called "multivariate linear models". These are not the same as multivariable linear models (also called "multiple linear models").
The Generalized linear model (GLM) is a framework for modeling response variables that are bounded or discrete. This is used, for example:
when modeling positive quantities (e.g. prices or populations) that vary over a large scale—which are better described using a skewed distribution such as the log-normal distribution or Poisson distribution (although GLMs are not used for log-normal data, instead the response variable is simply transformed using the logarithm function);
when modeling ordinal data, e.g. ratings on a scale from 0 to 5, where the different outcomes can be ordered but where the quantity itself may not have any absolute meaning (e.g. a rating of 4 may not be "twice as good" in any objective sense as a rating of 2, but simply indicates that it is better than 2 or 3 but not as good as 5).
Generalized linear models allow for an arbitrary link function, g, that relates the mean of the response variable(s) to the predictors: . The link function is often related to the distribution of the response, and in particular it typically has the effect of transforming between the range of the linear predictor and the range of the response variable.
Single index models[clarification needed] allow some degree of nonlinearity in the relationship between x and y, while preserving the central role of the linear predictor β′x as in the classical linear regression model. Under certain conditions, simply applying OLS to data from a single-index model will consistently estimate β up to a proportionality constant.[11]
Hierarchical linear models (or multilevel regression) organizes the data into a hierarchy of regressions, for example where A is regressed on B, and B is regressed on C. It is often used where the variables of interest have a natural hierarchical structure such as in educational statistics, where students are nested in classrooms, classrooms are nested in schools, and schools are nested in some administrative grouping, such as a school district. The response variable might be a measure of student achievement such as a test score, and different covariates would be collected at the classroom, school, and school district levels.
Errors-in-variables models (or "measurement error models") extend the traditional linear regression model to allow the predictor variables X to be observed with error. This error causes standard estimators of β to become biased. Generally, the form of bias is an attenuation, meaning that the effects are biased toward zero.
parameter of predictor variable represents the individual effect of . It has an interpretation as the expected change in the response variable when increases by one unit with other predictor variables held constant. When is strongly correlated with other predictor variables, it is improbable that can increase by one unit with other variables held constant. In this case, the interpretation of becomes problematic as it is based on an improbable condition, and the effect of cannot be evaluated in isolation.
For a group of predictor variables, say, , a group effect is defined as a linear combination of their parameters
where is a weight vector satisfying . Because of the constraint on , is also referred to as a normalized group effect. A group effect has an interpretation as the expected change in when variables in the group change by the amount , respectively, at the same time with other variables (not in the group) held constant. It generalizes the individual effect of a variable to a group of variables in that () if , then the group effect reduces to an individual effect, and () if and for , then the group effect also reduces to an individual effect.
A group effect is said to be meaningful if the underlying simultaneous changes of the variables is probable.
Group effects provide a means to study the collective impact of strongly correlated predictor variables in linear regression models. Individual effects of such variables are not well-defined as their parameters do not have good interpretations. Furthermore, when the sample size is not large, none of their parameters can be accurately estimated by the least squares regression due to the multicollinearity problem. Nevertheless, there are meaningful group effects that have good interpretations and can be accurately estimated by the least squares regression. A simple way to identify these meaningful group effects is to use an all positive correlations (APC) arrangement of the strongly correlated variables under which pairwise correlations among these variables are all positive, and standardize all predictor variables in the model so that they all have mean zero and length one. To illustrate this, suppose that is a group of strongly correlated variables in an APC arrangement and that they are not strongly correlated with predictor variables outside the group. Let be the centred and be the standardized . Then, the standardized linear regression model is
Parameters in the original model, including , are simple functions of in the standardized model. The standardization of variables does not change their correlations, so is a group of strongly correlated variables in an APC arrangement and they are not strongly correlated with other predictor variables in the standardized model. A group effect of is
and its minimum-variance unbiased linear estimator is
where is the least squares estimator of . In particular, the average group effect of the standardized variables is
which has an interpretation as the expected change in when all in the strongly correlated group increase by th of a unit at the same time with variables outside the group held constant. With strong positive correlations and in standardized units, variables in the group are approximately equal, so they are likely to increase at the same time and in similar amount. Thus, the average group effect is a meaningful effect. It can be accurately estimated by its minimum-variance unbiased linear estimator , even when individually none of the can be accurately estimated by .
Not all group effects are meaningful or can be accurately estimated. For example, is a special group effect with weights and for , but it cannot be accurately estimated by . It is also not a meaningful effect. In general, for a group of strongly correlated predictor variables in an APC arrangement in the standardized model, group effects whose weight vectors are at or near the centre of the simplex () are meaningful and can be accurately estimated by their minimum-variance unbiased linear estimators. Effects with weight vectors far away from the centre are not meaningful as such weight vectors represent simultaneous changes of the variables that violate the strong positive correlations of the standardized variables in an APC arrangement. As such, they are not probable. These effects also cannot be accurately estimated.
Applications of the group effects include (1) estimation and inference for meaningful group effects on the response variable, (2) testing for "group significance" of the variables via testing versus , and (3) characterizing the region of the predictor variable space over which predictions by the least squares estimated model are accurate.
A group effect of the original variables can be expressed as a constant times a group effect of the standardized variables . The former is meaningful when the latter is. Thus meaningful group effects of the original variables can be found through meaningful group effects of the standardized variables.[12]
In Dempster–Shafer theory, or a linear belief function in particular, a linear regression model may be represented as a partially swept matrix, which can be combined with similar matrices representing observations and other assumed normal distributions and state equations. The combination of swept or unswept matrices provides an alternative method for estimating linear regression models.
A large number of procedures have been developed for parameter estimation and inference in linear regression. These methods differ in computational simplicity of algorithms, presence of a closed-form solution, robustness with respect to heavy-tailed distributions, and theoretical assumptions needed to validate desirable statistical properties such as consistency and asymptotic efficiency.
Some of the more common estimation techniques for linear regression are summarized below.
Francis Galton's 1886[13] illustration of the correlation between the heights of adults and their parents. The observation that adult children's heights tended to deviate less from the mean height than their parents suggested the concept of "regression toward the mean", giving regression its name. The "locus of horizontal tangential points" passing through the leftmost and rightmost points on the ellipse (which is a level curve of the bivariate normal distribution estimated from the data) is the OLS estimate of the regression of parents' heights on children's heights, while the "locus of vertical tangential points" is the OLS estimate of the regression of children's heights on parent's heights. The major axis of the ellipse is the TLS estimate.
Assuming that the independent variables are and the model's parameters are , then the model's prediction would be
.
If is extended to then would become a dot product of the parameter and the independent vectors, i.e.
.
In the least-squares setting, the optimum parameter vector is defined as such that minimizes the sum of mean squared loss:
Now putting the independent and dependent variables in matrices and respectively, the loss function can be rewritten as:
Setting the gradient to zero produces the optimum parameter:
Note: The obtained may indeed be the local minimum, one needs to differentiate once more to obtain the Hessian matrix and show that it is positive definite. This is provided by the Gauss–Markov theorem.
Maximum likelihood estimation can be performed when the distribution of the error terms is known to belong to a certain parametric family ƒθ of probability distributions.[15] When fθ is a normal distribution with zero mean and variance θ, the resulting estimate is identical to the OLS estimate. GLS estimates are maximum likelihood estimates when ε follows a multivariate normal distribution with a known covariance matrix.
Let's denote each data point by and the regression parameters as , and the set of all data by and the cost function by .
As shown below the same optimal parameter that minimizes achieves maximum likelihood too.[16] Here the assumption is that the dependent variable is a random variable that follows a Gaussian distribution, where the standard deviation is fixed and the mean is a linear combination of :
Now, we need to look for a parameter that maximizes this likelihood function. Since the logarithmic function is strictly increasing, instead of maximizing this function, we can also maximize its logarithm and find the optimal parameter that way.[16]
In this way, the parameter that maximizes is the same as the one that minimizes . This means that in linear regression, the result of the least squares method is the same as the result of the maximum likelihood estimation method.[16]
Ridge regression[17][18][19] and other forms of penalized estimation, such as Lasso regression,[5] deliberately introduce bias into the estimation of β in order to reduce the variability of the estimate. The resulting estimates generally have lower mean squared error than the OLS estimates, particularly when multicollinearity is present or when overfitting is a problem. They are generally used when the goal is to predict the value of the response variable y for values of the predictors x that have not yet been observed. These methods are not as commonly used when the goal is inference, since it is difficult to account for the bias.
Least absolute deviation (LAD) regression is a robust estimation technique in that it is less sensitive to the presence of outliers than OLS (but is less efficient than OLS when no outliers are present). It is equivalent to maximum likelihood estimation under a Laplace distribution model for ε.[20]
If we assume that error terms are independent of the regressors, , then the optimal estimator is the 2-step MLE, where the first step is used to non-parametrically estimate the distribution of the error term.[21]
Bayesian linear regression applies the framework of Bayesian statistics to linear regression. (See also Bayesian multivariate linear regression.) In particular, the regression coefficients β are assumed to be random variables with a specified prior distribution. The prior distribution can bias the solutions for the regression coefficients, in a way similar to (but more general than) ridge regression or lasso regression. In addition, the Bayesian estimation process produces not a single point estimate for the "best" values of the regression coefficients but an entire posterior distribution, completely describing the uncertainty surrounding the quantity. This can be used to estimate the "best" coefficients using the mean, mode, median, any quantile (see quantile regression), or any other function of the posterior distribution.
Quantile regression focuses on the conditional quantiles of y given X rather than the conditional mean of y given X. Linear quantile regression models a particular conditional quantile, for example the conditional median, as a linear function βTx of the predictors.
Mixed models are widely used to analyze linear regression relationships involving dependent data when the dependencies have a known structure. Common applications of mixed models include analysis of data involving repeated measurements, such as longitudinal data, or data obtained from cluster sampling. They are generally fit as parametric models, using maximum likelihood or Bayesian estimation. In the case where the errors are modeled as normal random variables, there is a close connection between mixed models and generalized least squares.[22]Fixed effects estimation is an alternative approach to analyzing this type of data.
Principal component regression (PCR)[7][8] is used when the number of predictor variables is large, or when strong correlations exist among the predictor variables. This two-stage procedure first reduces the predictor variables using principal component analysis, and then uses the reduced variables in an OLS regression fit. While it often works well in practice, there is no general theoretical reason that the most informative linear function of the predictor variables should lie among the dominant principal components of the multivariate distribution of the predictor variables. The partial least squares regression is the extension of the PCR method which does not suffer from the mentioned deficiency.
Least-angle regression[6] is an estimation procedure for linear regression models that was developed to handle high-dimensional covariate vectors, potentially with more covariates than observations.
The Theil–Sen estimator is a simple robust estimation technique that chooses the slope of the fit line to be the median of the slopes of the lines through pairs of sample points. It has similar statistical efficiency properties to simple linear regression but is much less sensitive to outliers.[23]
Other robust estimation techniques, including the α-trimmed mean approach, and L-, M-, S-, and R-estimators have been introduced.
Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.
A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.
Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.
Early evidence relating tobacco smoking to mortality and morbidity came from observational studies employing regression analysis. In order to reduce spurious correlations when analyzing observational data, researchers usually include several variables in their regression models in addition to the variable of primary interest. For example, in a regression model in which cigarette smoking is the independent variable of primary interest and the dependent variable is lifespan measured in years, researchers might include education and income as additional independent variables, to ensure that any observed effect of smoking on lifespan is not due to those other socio-economic factors. However, it is never possible to include all possible confounding variables in an empirical analysis. For example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are often able to generate more compelling evidence of causal relationships than can be obtained using regression analyses of observational data. When controlled experiments are not feasible, variants of regression analysis such as instrumental variables regression may be used to attempt to estimate causal relationships from observational data.
The capital asset pricing model uses linear regression as well as the concept of beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.
Linear regression finds application in a wide range of environmental science applications such as land use,[28]infectious diseases,[29] and air pollution.[30] For example, linear regression can be used to predict the changing effects of car pollution.[31] One notable example of this application in infectious diseases is the flattening the curve strategy emphasized early in the COVID-19 pandemic, where public health officials dealt with sparse data on infected individuals and sophisticated models of disease transmission to characterize the spread of COVID-19.[32]
Linear regression is commonly used in building science field studies to derive characteristics of building occupants. In a thermal comfort field study, building scientists usually ask occupants' thermal sensation votes, which range from -3 (feeling cold) to 0 (neutral) to +3 (feeling hot), and measure occupants' surrounding temperature data. A neutral or comfort temperature can be calculated based on a linear regression between the thermal sensation vote and indoor temperature, and setting the thermal sensation vote as zero. However, there has been a debate on the regression direction: regressing thermal sensation votes (y-axis) against indoor temperature (x-axis) or the opposite: regressing indoor temperature (y-axis) against thermal sensation votes (x-axis).[33]
Isaac Newton is credited with inventing "a certain technique known today as linear regression analysis" in his work on equinoxes in 1700, and wrote down the first of the two normal equations of the ordinary least squares method.[35][36] The Least squares linear regression, as a means of finding a good rough linear fit to a set of points was performed by Legendre (1805) and Gauss (1809) for the prediction of planetary movement. Quetelet was responsible for making the procedure well-known and for using it extensively in the social sciences.[37]
^Freedman, David A. (2009). Statistical Models: Theory and Practice. Cambridge University Press. p. 26. A simple regression equation has on the right hand side an intercept and an explanatory variable with a slope coefficient. A multiple regression e right hand side, each with its own slope coefficient
^Rencher, Alvin C.; Christensen, William F. (2012), "Chapter 10, Multivariate regression – Section 10.1, Introduction", Methods of Multivariate Analysis, Wiley Series in Probability and Statistics, vol. 709 (3rd ed.), John Wiley & Sons, p. 19, ISBN978-1-118-39167-9, archived from the original on 2024-10-04, retrieved 2015-02-07.
^Yan, Xin (2009), Linear Regression Analysis: Theory and Computing, World Scientific, pp. 1–2, ISBN978-981-283-411-9, archived from the original on 2024-10-04, retrieved 2015-02-07, Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago. The earliest form of the linear regression was the least squares method, which was published by Legendre in 1805, and by Gauss in 1809 ... Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.
^ abTibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso". Journal of the Royal Statistical Society, Series B. 58 (1): 267–288. doi:10.1111/j.2517-6161.1996.tb02080.x. JSTOR2346178.
^ abHawkins, Douglas M. (1973). "On the Investigation of Alternative Regressions by Principal Component Analysis". Journal of the Royal Statistical Society, Series C. 22 (3): 275–286. doi:10.2307/2346776. JSTOR2346776.
^ abJolliffe, Ian T. (1982). "A Note on the Use of Principal Components in Regression". Journal of the Royal Statistical Society, Series C. 31 (3): 300–303. doi:10.2307/2348005. JSTOR2348005.
^Brillinger, David R. (1977). "The Identification of a Particular Nonlinear Time Series System". Biometrika. 64 (3): 509–515. doi:10.1093/biomet/64.3.509. JSTOR2345326.
^Tsao, Min (2022). "Group least squares regression for linear models with strongly correlated predictor variables". Annals of the Institute of Statistical Mathematics. 75 (2): 233–250. arXiv:1804.02499. doi:10.1007/s10463-022-00841-7. S2CID237396158.
^Swindel, Benee F. (1981). "Geometry of Ridge Regression Illustrated". The American Statistician. 35 (1): 12–15. doi:10.2307/2683577. JSTOR2683577.
^Draper, Norman R.; van Nostrand; R. Craig (1979). "Ridge Regression and James-Stein Estimation: Review and Comments". Technometrics. 21 (4): 451–466. doi:10.2307/1268284. JSTOR1268284.
^Hoerl, Arthur E.; Kennard, Robert W.; Hoerl, Roger W. (1985). "Practical Use of Ridge Regression: A Challenge Met". Journal of the Royal Statistical Society, Series C. 34 (2): 114–120. JSTOR2347363.
^Narula, Subhash C.; Wellington, John F. (1982). "The Minimum Sum of Absolute Errors Regression: A State of the Art Survey". International Statistical Review. 50 (3): 317–326. doi:10.2307/1402501. JSTOR1402501.
^Goldstein, H. (1986). "Multilevel Mixed Linear Model Analysis Using Iterative Generalized Least Squares". Biometrika. 73 (1): 43–56. doi:10.1093/biomet/73.1.43. JSTOR2336270.
Charles Darwin. The Variation of Animals and Plants under Domestication. (1868) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
Draper, N. R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. ISBN978-0-471-17082-2.
Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246–263 (1886). (Facsimile at: [1]Archived 2016-03-10 at the Wayback Machine)
Robert S. Pindyck and Daniel L. Rubinfeld (1998, 4th ed.). Econometric Models and Economic Forecasts, ch. 1 (Intro, including appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).
National Physical Laboratory (1961). "Chapter 1: Linear Equations and Matrices: Direct Methods". Modern Computing Methods. Notes on Applied Science. Vol. 16 (2nd ed.). Her Majesty's Stationery Office.
Linear regression is a fundamental statistical method used to model the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables) by fitting a linear equation to observed data, where the best-fitting line is determined by minimizing the sum of squared differences between observed and predicted values, known as the least squares method.[1] This approach assumes a linear relationship, often expressed for simple linear regression as Y=a+bX+ϵ, where Y is the dependent variable, X is the independent variable, a is the y-intercept, b is the slope, and ϵ represents the error term.[2] The method extends to multiple linear regression, incorporating several predictors as Y=a+b1X1+b2X2+⋯+bnXn+ϵ.[1]Key assumptions underlying linear regression include linearity in the parameters, independence of errors, homoscedasticity (constant variance of errors), and often normality of the error distribution for inference purposes, though the least squares estimation itself does not require normality.[3] These assumptions can be checked using diagnostic plots such as scatterplots for linearity and residual plots for homoscedasticity and normality.[1] Violations may necessitate data transformations or alternative models, but the technique remains robust for many applications due to the central limit theorem approximating normality in large samples.[2]Linear regression serves multiple purposes, including describing relationships between variables, estimating unknown values of the dependent variable, and predicting future outcomes or prognostication, such as identifying risk factors in medical studies (e.g., predicting blood pressure from age and weight).[1] It is widely applied across fields like economics, biology, engineering, and social sciences for tasks ranging from forecasting sales based on advertising spend to analyzing the impact of environmental factors on crop yields.[2] The model's simplicity and interpretability—where coefficients directly indicate the change in the dependent variable per unit change in an independent variable—make it a cornerstone of statistical analysis.[1]The origins of linear regression trace back to the development of the least squares method, first published by Adrien-Marie Legendre in 1805 for astronomical calculations, though Carl Friedrich Gauss claimed prior invention around 1795 and formalized its probabilistic justification in 1809.[4] The term "regression" was coined by Francis Galton in the 1880s during his studies on heredity using pea plant data, observing how offspring traits "regressed" toward the population mean, with Karl Pearson later refining the mathematical framework in the 1890s through correlation and multiple regression extensions.[5] This evolution transformed least squares from a computational tool into a cornerstone of modern inferential statistics.[4]
Fundamentals
Definition and Model Formulation
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The model assumes that the conditional expectation of the response variable, given the predictors, can be expressed as a linear combination of the predictors. This approach is foundational in statistics and is widely applied in fields such as economics, biology, and engineering for predictive modeling and inference.[6]In its general matrix formulation for multiple linear regression, the model is expressed asY=Xβ+ε,where Y is an n×1 vector of observed responses, X is an n×pdesign matrix containing the predictor variables (with the first column typically being a vector of ones to account for the intercept), β is a p×1 vector of unknown regression coefficients, and ε is an n×1 vector of random error terms. The error terms satisfy E(ε)=0 and Var(ε)=σ2In, where σ2 is the error variance and In is the n×nidentity matrix, implying that the errors have mean zero and are uncorrelated with constant variance.[7][8][7]The linearity in the model refers specifically to the parameters β, meaning the response is a linear function of these coefficients, though the predictors in X may involve nonlinear transformations of the original variables (e.g., polynomials or interactions). The intercept term β0 is included as the first element of β, corresponding to the constant column in X, and represents the expected value of Y when all predictors are zero. This formulation allows for the conditional expectation E(Y∣X)=Xβ to serve as the mean response surface, derived from the zero mean of the errors: E(Y∣X)=Xβ+E(ε∣X)=Xβ, assuming the errors are independent of the predictors.[9][8][6]For the simple case with a single predictor, the model simplifies to yi=β0+β1xi+εi for i=1,…,n, where yi is the i-th response, xi is the predictor value, β0 is the intercept, β1 is the slope coefficient, and εi is the error term with E(εi)=0 and Var(εi)=σ2. This setup captures the essence of the linear relationship, where the expected response E(yi∣xi)=β0+β1xi increases or decreases linearly with xi, depending on the sign of β1.[9][6]
Simple Linear Regression Example
To illustrate simple linear regression, consider a hypothetical dataset consisting of heights (in inches, denoted as X) and weights (in pounds, denoted as Y) for 10 adult males, drawn from a larger body measurements study.[10]The data are presented in the following table:
Individual
Height (X)
Weight (Y)
1
67.75
154.25
2
72.25
173.25
3
66.25
154.00
4
72.25
184.75
5
71.25
184.25
6
74.75
210.25
7
69.75
181.00
8
72.50
176.00
9
74.00
191.00
10
73.50
198.25
A scatter plot of these points shows a clear positive linear trend, with weight increasing as height increases, though some variability exists around the trend line.[11]The ordinary least squares method fits the model by estimating the slope β1 and intercept β0. The slope is given byβ1=Var(X)Cov(X,Y),where Cov(X,Y) is the sample covariance and Var(X) is the sample variance of X.[12] Here, the sample means are xˉ=71.425 inches and yˉ=180.7 pounds, yielding Cov(X,Y)≈43.07 and Var(X)≈7.51, so β1≈5.73 pounds per inch. The intercept is thenβ0=yˉ−β1xˉ≈180.7−(5.73)(71.425)≈−228.6pounds. Thus, the fitted equation is y^=−228.6+5.73x.[12]Using this equation, the predicted weight for a new individual with height x=70 inches is y^=−228.6+5.73(70)≈172.5 pounds.[11]To assess the fit, residuals are computed as ei=yi−y^i for each observation. For example, for the first individual (height 67.75 inches, weight 154.25 pounds), y^1≈−228.6+5.73(67.75)≈159.6 pounds, so e1≈154.25−159.6=−5.35 pounds. Residuals quantify the model's prediction errors and play a key role in evaluating whether the data align with the assumptions of linearity and independence of errors.[12]
Notation and Basic Interpretation
In linear regression, the dependent variable is typically denoted as y, representing the outcome or response of interest, while the independent variables are denoted as x1,x2,…,xp, where p is the number of predictors.[13] The model parameters include the coefficients β0,β1,…,βp, where β0 is the intercept and βj (for j=1,…,p) are the slope coefficients associated with each predictor.[13] The error term's variance is denoted as σ2, capturing the variability not explained by the predictors, and the sample size is n, the number of observations.The coefficient βj is interpreted as the expected change in the dependent variable y for a one-unit increase in the independent variable xj, holding all other predictors constant.[14] This partial effect highlights the unique contribution of each predictor to the response in a multivariate setting.[14] The coefficient of determination, R2, measures the proportion of the total variance in y that is explained by the model, ranging from 0 (no explanatory power) to 1 (perfect fit).[15]The intercept β0 represents the expected value of y when all independent variables xj=0 for j=1,…,p.[14] This baseline value provides a reference point for the response under the condition of zero predictor values, though its practical relevance depends on whether such a scenario is meaningful in the data context.[14]The magnitude and interpretation of coefficients are sensitive to the units of measurement for both y and the xj; for instance, changing the scale of a predictor from meters to centimeters scales the corresponding βj by a factor of 100, altering its numerical value while preserving the underlying relationship.[14]Standardization of variables can mitigate such scaling effects, yielding coefficients that reflect relative importance in standard deviation units.[16]
Assumptions and Limitations
Core Assumptions
The core assumptions underlying the linear regression model ensure that the ordinary least squares (OLS) estimator is unbiased, consistent, and efficient for inference and prediction. These assumptions, collectively known as the Gauss-Markov assumptions when excluding normality, form the foundation for the model's theoretical properties, including the BLUE (best linear unbiased estimator) status of OLS.[17] While some are required for estimation and others primarily for statistical inference, violations can compromise the reliability of coefficient estimates and hypothesis tests.[3]The linearity assumption requires that the conditional expectation of the dependent variable Y given the predictors X is a linear function of the parameters:E(Y∣X)=Xβ,where β is the vector of coefficients and X includes an intercept column. This implies that the model correctly captures the systematic relationship between predictors and the expected response, with deviations from this line attributed solely to random error.[17] In the standard formulation Y=Xβ+ε, linearity ensures that the error term ε has a conditional mean of zero, E(ε∣X)=0, preventing systematic bias in predictions.[3]Independence of the errors assumes that the error terms εi for different observations i are independent, meaning Cov(εi,εj∣X)=0 for all i=j. This condition, stronger than mere uncorrelatedness in the Gauss-Markov framework, is essential in cross-sectional or experimental data to ensure that observations do not influence one another through the errors, supporting the validity of standard error calculations and confidence intervals.[17]Homoscedasticity requires that the conditional variance of the errors is constant across all levels of the predictors:Var(εi∣X)=σ2for all i, where σ2 is a positive constant. This equal spread of errors around the regression line ensures that the variance-covariance matrix of the errors is spherical (σ2I), which is crucial for the efficiency of OLS estimates and the reliability of t-tests and F-tests.[17] Without it, the precision of estimates would vary systematically with predictor values, leading to inefficient inference.[3]Normality of the errors is assumed for valid statistical inference: εi∣X∼N(0,σ2) independently for each i. This Gaussian distribution facilitates exact finite-sample inference, such as the t-distribution for coefficient significance and the F-distribution for overall model fit, particularly in small samples.[17] Although not required for the consistency or unbiasedness of OLS point estimates, it is optional for large-sample asymptotics where the central limit theorem can approximate normality.[3]Finally, the absence of perfect multicollinearity assumes that the predictor matrix X is of full column rank, ensuring (XTX) is invertible and that the coefficients β can be uniquely estimated. This prevents linear dependence among the predictors, which would otherwise make individual effects indistinguishable and render OLS undefined.[17] High but imperfect multicollinearity may inflate variances but does not violate this core requirement.[3]
Consequences of Violations
Violations of the core assumptions in linear regression can lead to biased estimates, inefficient predictions, and invalid statistical inferences, undermining the reliability of the model for both explanatory and predictive purposes.[3] Specifically, these breaches affect the ordinary least squares (OLS) estimator's properties, such as unbiasedness, efficiency, and the validity of standard errors, t-tests, and confidence intervals.[18] While OLS remains unbiased under many violations, the consequences often manifest in overstated precision or unreliable hypothesis testing, particularly in finite samples.[19]Nonlinearity in the relationship between predictors and the response variable results in systematically biased predictions, as the linear model attempts to approximate a curved or nonadditive pattern with a straight line, leading to over- or under-predictions across parts of the data range.[20] This violation also tends to underestimate the true variance of the errors, inflating measures of model fit like R² and producing overly narrow confidence intervals that fail to capture the actual uncertainty.[3]Heteroscedasticity, where error variance changes with the level of predictors, renders OLS estimates inefficient, meaning they have larger variance than the best linear unbiased estimator, though unbiasedness is preserved.[21] More critically, it invalidates the usual standard errors, leading to unreliable t-tests and F-tests that may incorrectly reject or fail to reject hypotheses; for instance, standard errors can be either underestimated or overestimated depending on the form of heteroscedasticity.[19] To address this for inference, heteroscedasticity-consistent standard errors, such as White's estimator, can provide robust alternatives without altering the point estimates.[22]Autocorrelation in error terms, common in time series data, causes the OLS estimates to remain unbiased and consistent but inefficient, with underestimated standard errors that inflate the significance of coefficients and lead to overly optimistic t-statistics and p-values.[23] This serial correlation also artificially inflates the R² statistic, suggesting stronger explanatory power than actually exists, and renders confidence intervals too narrow, increasing the risk of Type I errors in hypothesis testing.[24]Non-normality of errors does not bias OLS estimates or affect their consistency, but it compromises exact inference in small samples, where t- and F-distributions no longer hold, leading to inaccurate p-values and confidence intervals.[18] However, for large samples, the central limit theorem ensures asymptotic normality of the estimators, making inference robust; in small samples, severe non-normality can distort significance tests, though the impact diminishes with sample size exceeding 30–50 observations.[25]In multiple linear regression, multicollinearity among predictors inflates the variance of the coefficient estimates, making them highly sensitive to small changes in data and leading to unstable predictions with wide confidence intervals, even if the overall model fit remains adequate.[26] This high variance does not bias the estimates but reduces their precision, often resulting in insignificant individual t-tests despite a significant overall F-test; the variance inflation factor (VIF), calculated as VIFj=1−Rj21 where Rj2 is the R² from regressing predictor j on others, quantifies this, with VIF > 10 indicating problematic multicollinearity.[27]
Diagnostic Techniques
Diagnostic techniques in linear regression involve graphical and statistical methods to evaluate the validity of model assumptions after estimation, such as linearity, homoscedasticity, normality, independence, and lack of multicollinearity. These tools primarily rely on residuals, defined as the differences between observed and fitted values, to identify potential violations that could lead to unreliable inferences. By examining residuals, analysts can detect patterns indicative of model misspecification or data issues, enabling informed decisions about model refinement.Graphical methods are foundational for assessing several assumptions. A scatterplot of residuals against fitted values helps diagnose linearity and homoscedasticity; under ideal conditions, points should scatter randomly around zero without systematic trends or funneling patterns that suggest non-constant variance.[28] Similarly, a quantile-quantile (Q-Q) plot compares the distribution of residuals to a theoretical normal distribution, with points aligning closely to the reference line indicating approximate normality; deviations in the tails may signal outliers or non-normality.[28]For detecting autocorrelation in residuals, particularly in time-series data, the Durbin-Watson test provides a statistical measure. The test statistic is calculated asDW=∑i=1nei2∑i=1n−1(ei+1−ei)2,where ei are the residuals and n is the sample size; values near 2 indicate no first-order autocorrelation, while values below 1.5 or above 2.5 suggest positive or negative autocorrelation, respectively, with critical values depending on the number of predictors and sample size.[29]Multicollinearity among predictors is assessed using the variance inflation factor (VIF) for each regressor xj, defined asVIFj=1−Rj21,where Rj2 is the coefficient of determination from regressing xj on all other predictors; VIF values exceeding 5 or 10 typically indicate problematic multicollinearity, inflating coefficient variances and standard errors.Influence and leverage diagnostics identify observations that disproportionately affect the fitted model. Leverage measures, derived from the hat matrix H=X(XTX)−1XT, quantify how much each observation pulls the fit toward itself, with diagonal elements hii ranging from 0 to 1; high leverage (hii>2p/n, where p is the number of parameters) warrants scrutiny. Cook's distance further combines leverage and residual size to measure overall influence:Di=p⋅MSEei2⋅(1−hii)2hii,where ei is the residual, MSE is the mean squared error, and values exceeding 4/p or Fp,n−p(50,1) suggest influential points that may distort parameter estimates.Outliers, which can bias results, are detected using studentized residuals, defined as ti=ei/MSE(1−hii), adjusting the ordinary residual for its estimated standard error. These follow a t-distribution under the model, allowing formal tests; absolute values greater than 3 or 2.5 (depending on significance level) flag potential outliers, as they deviate markedly from expected behavior.
Estimation Methods
Ordinary Least Squares
Ordinary least squares (OLS) is the canonical method for estimating the parameters of a linear regression model by minimizing the sum of squared residuals between observed and predicted values. Introduced by Adrien-Marie Legendre in 1805 for determining comet orbits and independently justified probabilistically by Carl Friedrich Gauss in 1809, OLS selects the parameter vector β that best fits the data in a least-squares sense.[30][31] The residuals are the differences ei=yi−y^i, and the objective function to minimize is the residual sum of squares:S(β)=(Y−Xβ)′(Y−Xβ),where Y is the n×1 vector of observations, X is the n×(k+1) design matrix (including an intercept column of ones), and β is the (k+1)×1 parameter vector.[32]To derive the OLS estimator, differentiate S(β) with respect to β and set the result to zero, yielding the normal equations:X′Xβ=X′Y.This system of k+1 equations in k+1 unknowns arises from the first-order conditions for minimization. Assuming X′X is invertible (which requires the columns of X to be linearly independent and full rank), the closed-form solution is:β^=(X′X)−1X′Y.This explicit formula allows direct computation of the estimates without iterative methods, provided the matrix inversion is feasible.[32]A property of the OLS estimator is that including an additional observation lying exactly on the original fitted hyperplane—such that its observed value equals the predicted value from the original estimates—does not change the OLS estimates. This holds because the residual for this new point is zero, leaving the sum of squared residuals minimized by the original solution. In contrast, if the new point deviates from the fitted values, the estimates necessarily change, as the least squares criterion identifies a unique minimizer under standard conditions.[33]Under the core assumptions of the linear regression model—linearity in parameters, strict exogeneity of errors, no perfect multicollinearity, and homoskedasticity of errors—the OLS estimator possesses desirable statistical properties. It is unbiased, meaning E(β^)=β, so on average, the estimates equal the true parameters.[34] Furthermore, by the Gauss-Markov theorem, β^ is the best linear unbiased estimator (BLUE), with the smallest variance among all linear unbiased estimators of β. The variance-covariance matrix of β^ is:Var(β^)=σ2(X′X)−1,where σ2 is the variance of the error terms; this matrix quantifies the precision of the estimates, with diagonal elements giving the variances of individual β^j.[34]Once β^ is obtained, the model generates fitted values Y^=Xβ^, which serve as predictions for the response variable. The mean squared error (MSE) of these in-sample predictions, also known as the error variance estimate, is computed from the residuals as MSE=n−k−1(Y−Y^)′(Y−Y^), providing a measure of prediction accuracy and an unbiased estimate of σ2.[35]
Maximum Likelihood Estimation
In the linear regression model, maximum likelihood estimation (MLE) provides a probabilistic framework for parameter estimation by assuming that the errors follow a normal distribution, i.e., ϵi∼N(0,σ2) independently for i=1,…,n.[36] Under this assumption, the likelihood function for the parameters β (the vector of coefficients) and σ2 (the error variance) isL(β,σ2)=(2πσ2)−n/2exp(−2σ21i=1∑n(yi−xiTβ)2),where yi is the observed response, xi is the vector of predictors (including the intercept), and the sum of squared residuals measures the discrepancy between observed and predicted values.[37] This formulation treats the observations as independent and identically distributed (i.i.d.) from N(Xβ,σ2In), where X is the n×(p+1) design matrix.[38]To find the MLE, it is computationally convenient to maximize the log-likelihood instead:ℓ(β,σ2)=−2nlog(2πσ2)−2σ21∥y−Xβ∥2,where ∥⋅∥2 denotes the squared Euclidean norm.[39] Differentiating ℓ with respect to β yields the score equations ∂β∂ℓ=σ21XT(y−Xβ)=0, which simplify to the normal equations XTXβ=XTy under the assumption that XTX is invertible.[36] Solving these gives the MLE for β, β^MLE=(XTX)−1XTy, which coincides exactly with the ordinary least squares (OLS) estimator.[38] For σ2, substituting β^MLE and differentiating ℓ with respect to σ2 produces σ^MLE2=n1∑i=1ne^i2=n∥y−Xβ^MLE∥2, where e^i=yi−xiTβ^MLE are the residuals; this estimator is biased downward, in contrast to the unbiased OLS variance estimators2=n−p−1∥y−Xβ^OLS∥2.[37]Under standard regularity conditions (including normality of errors and fixed design matrix with full rank), the MLE β^MLE is consistent and asymptotically normal as n→∞: n(β^MLE−β)dN(0,σ2plim(n−1XTX)−1), or more precisely, β^MLE∼N(β,σ2(XTX)−1) in finite samples under exact normality.[40] These properties enable asymptotic inference on β. The Wald test assesses hypotheses of the form H0:Rβ=r (where R is a restriction matrix and r a vector) by forming the quadratic form W=n(Rβ^MLE−r)T[RV^RT]−1(Rβ^MLE−r), where V^=σ^2(XTX)−1 is the estimated asymptotic covariance; under H0, Wdχq2 as n→∞, with q=rank(R).[41] Alternatively, the likelihood ratio test compares the log-likelihoods under the full and restricted models: LR=2[ℓ(β^MLE,σ^MLE2)−ℓ(β^R,σ^R2)], which also follows χq2 asymptotically under H0.[41] Both tests leverage the information matrix equality, equating the expected outer product of scores to the negative expected Hessian of the log-likelihood, ensuring their validity for large samples.[40]
Alternative Estimators
Least absolute deviation (LAD) estimation provides a robust alternative to ordinary least squares (OLS) by minimizing the sum of absolute residuals rather than squared residuals, making it less sensitive to outliers.[42] The LAD estimator is defined asβ^LAD=argβmini=1∑n∣yi−xi′β∣,where yi is the observed response, xi are the predictors, and β are the coefficients. This objective function corresponds to the conditional median of the response given the predictors, which is inherently robust to extreme values since the median is less affected by outliers than the mean.[42] Computationally, the LAD problem can be reformulated as a linear programming task, allowing efficient solution via standard optimization algorithms like the simplex method.[42]Quantile regression extends the LAD approach to estimate conditional quantiles of the response distribution at any level τ∈(0,1), offering a more complete picture of the relationship between predictors and the full range of response outcomes.[43] The estimator is given byβ^(τ)=argβmini=1∑nρτ(yi−xi′β),where the check function ρτ(u)=u(τ−I(u<0)) weights positive and negative residuals asymmetrically based on τ, with I(⋅) as the indicator function.[43] When τ=0.5, this reduces to the median regression, recovering the LAD estimator.[43] Like LAD, quantile regression can be solved using linear programming, and it is particularly useful in heteroscedastic settings or when interest lies in tail behaviors of the response.[43]Ridge regression addresses multicollinearity in the predictors by introducing a penalty term that shrinks the coefficients toward zero, stabilizing estimates when OLS variances become large.[44] The ridge estimator is obtained by minimizingβ^ridge=argβmin(∥y−Xβ∥2+λ∥β∥2),where λ>0 is a tuning parameter controlling the degree of shrinkage, y is the response vector, and X is the design matrix.[44] This can be derived from the normal equations of OLS, X′Xβ=X′y, by augmenting the design matrix with a ridge λI term: (X′X+λI)β=X′y, yielding the closed-form solution β^ridge=(X′X+λI)−1X′y.[44] The bias introduced by shrinkage is traded off against reduced variance, often improving mean squared error in correlated predictor scenarios.[44]Bayesian linear regression incorporates prior beliefs about the coefficients, yielding estimators as posterior means under a normal prior, which provides a probabilistic framework for uncertainty quantification beyond point estimates. Assuming a normal likelihood for the errors and a normal prior β∼N(μ0,Σ0), the posterior distribution is also normal, with meanβ^Bayes=(X′X+Σ0−1)−1(X′y+Σ0−1μ0),resembling a shrunk OLS estimator where the prior acts as regularization. This approach naturally handles multicollinearity through the prior covariance and allows for model comparison via posterior predictive checks.
Extensions and Variants
Multiple Linear Regression
Multiple linear regression extends the simple linear regression framework to incorporate multiple predictor variables, allowing for the modeling of more complex relationships between the response variable and a set of explanatory factors. The model is expressed as Yi=β0+∑j=1pβjXij+ϵi for i=1,…,n, where Yi is the response, Xij are the predictors, βj are the coefficients, and ϵi are independent errors with mean zero and constant variance.[8] In matrix notation, this becomes Y=Xβ+ϵ, where Y is an n×1 vector of responses, X is an n×(p+1) design matrix with a column of ones for the intercept, β is a (p+1)×1 vector of coefficients, and ϵ is an n×1 error vector.[8] This formulation facilitates computational efficiency and theoretical analysis, particularly for estimation via ordinary least squares, where the coefficient vector is β^=(X⊤X)−1X⊤Y, assuming X⊤X is invertible.[8]The coefficients βj in multiple linear regression represent partial effects, capturing the change in the expected value of Y associated with a one-unit increase in Xj, while holding all other predictors constant at their means or specific values.[45] Unlike marginal effects in simple regression (where p=1), these partial coefficients adjust for confounding among predictors, isolating the unique contribution of each variable to the response.[46] This interpretation is crucial for causal inference in observational data but assumes the model is correctly specified and free of multicollinearity that could inflate standard errors.[45]To assess overall model fit and significance, the coefficient of determinationR2 measures the proportion of variance explained by the predictors, but it increases with added variables regardless of relevance.[47] The adjusted R2 addresses this by penalizing model complexity:Rˉ2=1−(1−R2)n−p−1n−1,where n is the sample size and p is the number of predictors; higher values indicate better fit relative to baseline models.[47] For overall significance, the F-test evaluates whether the model explains more variance than an intercept-only model:F=(1−R2)/(n−p−1)R2/p,which follows an F-distribution with p and n−p−1degrees of freedom under the null hypothesis that all βj=0 for j≥1.[46] A significant F-statistic (low p-value) supports retaining the full model.Variable selection in multiple linear regression aims to identify a subset of predictors that balances fit and parsimony, often using stepwise methods guided by information criteria. Forward stepwise selection begins with an intercept-only model and iteratively adds the predictor that most improves fit (e.g., via largest reduction in residual sum of squares or F-statistic), stopping when no addition yields significant gain.[48] Backward stepwise selection starts with all predictors and removes the least contributory one step-by-step until removals degrade fit unacceptably.[48] These procedures commonly employ the Akaike information criterion (AIC) for selection, defined as AIC=−2logL+2k, where L is the likelihood and k=p+1 is the number of parameters; lower AIC values favor models with good predictive accuracy penalized for complexity.[49] While computationally efficient, stepwise methods risk overfitting and are sensitive to collinearity, prompting caution in interpretation.[48]
Generalized Linear Models
Generalized linear models (GLMs) extend the framework of linear regression to accommodate response variables that follow distributions other than the normal distribution, such as binomial or Poisson, by incorporating a link function that connects the mean of the response to a linear predictor.[50] The model consists of three main components: a random component specifying the distribution of the response variable Y, typically from the exponential family (e.g., Poisson for count data or binomial for binary/proportion data); a linear predictor η=Xβ, where X is the design matrix and β are the coefficients; and a link function g(μ)=η, where μ=E(Y) is the expected value of the response.[50] This structure allows GLMs to model non-constant variance and non-linear relationships between predictors and the mean response while maintaining the linear form in the predictors.[50]A key feature of GLMs is the use of canonical link functions, which simplify estimation and interpretation by aligning the link with the natural parameter of the exponential family distribution.[50] For the Gaussian distribution, the canonical link is the identity functiong(μ)=μ, which recovers the standard linear regression model.[50] In the case of the binomial distribution, the logit link g(μ)=log(μ/(1−μ)) is canonical, suitable for modeling probabilities.[50] For the Poisson distribution, the log link g(μ)=log(μ) serves as the canonical form, enabling the analysis of count data with multiplicative effects.[50]Parameter estimation in GLMs is typically performed using iteratively reweighted least squares (IRLS), an algorithm that iteratively solves weighted least squares problems to maximize the likelihood.[50] In each iteration, a working response is constructed as zi=ηi+(yi−μi)g′(μi), where ηi is the current linear predictor, yi is the observed response, μi is the current fitted mean, and g′(μi) is the derivative of the link function; the coefficients β are then updated by weighted ordinary least squares with weights wi=1/[V(μi)(g′(μi))2], where V(μ) is the variance function of the distribution.[50] This process converges to the maximum likelihood estimates under the specified distribution.[50]Goodness-of-fit in GLMs is assessed using the deviance, defined as D=2[l(saturated)−l(fitted)], where l denotes the log-likelihood, analogous to −2logL in likelihood ratio tests.[50] The deviance measures the discrepancy between the fitted model and a saturated model that perfectly fits the data, with smaller values indicating better fit; under the null hypothesis of adequate fit, it approximately follows a chi-squared distribution with degrees of freedom equal to the number of observations minus the number of parameters.[50]To handle overdispersion, where the observed variance exceeds that implied by the model (e.g., ϕ>1 for a dispersion parameter in quasi-likelihood approaches), GLMs can incorporate a scale parameterϕ such that Var(Y)=ϕV(μ), allowing robust estimation without altering the mean structure.[51] This quasi-likelihood extension, introduced to address situations like clustered or heterogeneous data, scales the standard errors by ϕ while preserving the IRLS estimation procedure.[51]
Robust and Regularized Variants
Robust and regularized variants of linear regression extend the classical model to address violations of core assumptions such as homoscedasticity, independence in clustered data, measurement errors in predictors, endogeneity due to correlation between regressors and errors, and challenges in high-dimensional settings where the number of predictors exceeds observations. These methods improve estimation efficiency, reduce bias, and promote sparsity or interpretability while maintaining the linear structure. They are particularly valuable in applied fields like econometrics, social sciences, and machine learning, where real-world data often deviate from ideal conditions.Heteroscedastic models arise when the variance of the errors is not constant across observations, violating the homoscedasticity assumption and leading to inefficient ordinary least squares estimates. Weighted least squares (WLS) addresses this by assigning weights inversely proportional to the error variances, yielding more efficient estimators under known or estimated heteroscedasticity. The WLS estimator is given byβ^=(XTWX)−1XTWY,where W is a diagonal matrix with entries wi=1/\Var(yi), and the variances may be estimated iteratively from residuals of an initial ordinary least squares fit. This approach, originally formulated as a generalization of least squares for weighted observations, enhances precision in settings like cross-sectional economic data with varying scales.[52]Hierarchical or multilevel models, also known as random effects models, accommodate data with nested structures, such as students within schools or repeated measures within individuals, where observations are not independent due to group-level variation. These models partition the error term into fixed effects common across groups and random effects specific to each group, allowing intercepts or slopes to vary hierarchically. A basic two-level model for grouped data isyij=Xijβ+Zijuj+ϵij,where i indexes observations within group j, uj∼N(0,Σu) captures random effects at the group level, and ϵij∼N(0,σ2) is the residual error, assuming independence within groups. Estimation typically uses maximum likelihood or restricted maximum likelihood, accounting for the covariance structure induced by the random effects. This framework, developed for educational and social research, properly handles intraclass correlation and provides more accurate inference for group-varying parameters.[53]Errors-in-variables models occur when predictors X contain measurement errors, causing ordinary least squares to produce biased and inconsistent estimates of βdue to attenuation bias. Total least squares (TLS), a robust alternative, minimizes the perpendicular distances from data points to the fitted hyperplane, perturbing both X and Y to account for errors in all variables. The TLS solution involves the singular value decomposition of the augmented matrix[XY], where β^ is derived from the right singular vector corresponding to the smallest singular value of the perturbed matrix. This method, analyzed through numerical linear algebra, yields consistent estimators under classical error assumptions and is widely applied in calibration problems and approximate solutions to overdetermined systems. For cases where measurement errors in X correlate with the true regressors, instrumental variables provide an adjustment by using exogenous instruments Z uncorrelated with the errors but correlated with X.[54]Lasso regression introduces regularization to linear models, particularly useful in high-dimensional settings where p>n (number of predictors exceeds sample size), by shrinking coefficients toward zero and performing automatic variable selection through sparsity. The lasso estimator solves the penalized optimization problemβ^lasso=argβmin∥Y−Xβ∥22+λ∥β∥1,where λ≥0 controls the strength of the L1 penalty on the absolute values of the coefficients, driving irrelevant predictors exactly to zero. This promotes parsimonious models with improved prediction accuracy and interpretability, outperforming ordinary least squares in scenarios with multicollinearity or irrelevant features, as demonstrated in simulation studies and real datasets like gene expressionanalysis. The method combines the benefits of subset selection and ridge regression while ensuring computational tractability via convex optimization.[55]Instrumental variables (IV) regression mitigates endogeneity, where regressors correlate with the error term due to omitted variables, simultaneity, or measurement error, leading to biased ordinary least squares estimates. IV uses external instruments Z that are correlated with the endogenous regressors but uncorrelated with the errors, enabling identification of causal effects. The simple IV estimator for a single endogenous regressor isβ^IV=(ZTX)−1ZTY,which can be viewed as a two-stage least squares procedure: first regress X on Z to obtain fitted values X^, then regress Y on X^. This approach, rooted in early econometric work on supply-demand systems, provides consistent estimates under valid instrument conditions and is foundational for causal inference in observational data, though it requires careful testing for instrument strength and validity.[56]
Applications
Trend Analysis and Prediction
Linear regression serves as a fundamental tool for trend analysis by fitting a straight line to time-series data, capturing the underlying linear pattern over time. The model is typically expressed as yt=β0+β1t+εt, where yt is the observed value at time t, β0 is the intercept, β1 is the slope representing the trend rate, and εt is the error term assumed to be normally distributed with mean zero and constant variance.[57] In technical analysis, linear regression is used to project future stock prices based on historical trends.[58][59] This approach enables detrending, where the linear component is subtracted from the data to isolate random fluctuations or cyclical patterns for further analysis.[60]For prediction, linear regression provides point forecasts by extrapolating the fitted trend line beyond the observed data range. However, to quantify uncertainty, prediction intervals are constructed around these forecasts, given byy^new±tn−p−1,1−α/2s1+xnew′(X′X)−1xnew,where y^new is the predicted value, tn−p−1,1−α/2 is the critical value from the t-distribution with n−p−1 degrees of freedom, s is the residual standard error, p is the number of predictors, and xnew is the vector for the new observation.[61] These intervals widen as extrapolation extends further from the data, reflecting increased uncertainty due to potential violations of model assumptions outside the fitted range.[61]Confidence bands, in contrast, provide intervals for the mean trend function rather than individual predictions and are narrower, following a similar form but omitting the "+1" term inside the square root:y^new±tn−p−1,1−α/2sxnew′(X′X)−1xnew.These bands parallel the trend line and are used to assess the reliability of the estimated mean trajectory, such as in visualizing long-term growth patterns.[62]In economic forecasting, linear regression models are commonly applied to predict indicators like GDP growth, often after detrending to focus on deviations from steady-state paths. For instance, quarterly GDP data can be modeled with a linear trend in logged values to estimate growth rates, but prior checks for stationarity—using tests like the Augmented Dickey-Fuller—are essential to avoid spurious regressions from non-stationary series.[63] Non-stationarity, indicated by unit roots, implies that shocks have persistent effects, necessitating differencing or cointegration analysis before applying the model.[63]Long-term predictions using linear regression face significant limitations when underlying assumptions drift, such as shifts in linearity due to structural economic changes or evolving relationships between variables. Violations of linearity or stationarity can lead to biased forecasts and unreliable extrapolation, as the model fails to capture nonlinear dynamics or persistent trends that emerge over extended horizons.[3] In such cases, the widening prediction intervals underscore the model's reduced accuracy for distant forecasts, emphasizing the need for periodic re-estimation.[64]
Use in Specific Disciplines
In epidemiology, linear regression is commonly employed to model dose-response relationships, such as regressing disease rates or health outcomes on exposure levels while adjusting for confounders like age, socioeconomic status, or environmental factors. For instance, studies have used linear and log-linear regression models to examine the impact of blood lead levels on children's IQ, controlling for variables including maternal IQ, education, and birth weight, revealing that a doubling of blood lead concentration is associated with a 2.6-point IQ decrement. Similarly, linear regression has been applied to assess lead exposure's effect on blood pressure, where log-linear models show a 1.0 mm Hg increase in systolic blood pressure per doubling of blood lead levels after confounder adjustment. These models enable public health policymakers to quantify risks and evaluate interventions, such as lead reduction programs that have yielded substantial economic benefits.[65][65][65][66]In finance, linear regression underpins the Capital Asset Pricing Model (CAPM), where the beta coefficient measures an asset's systematic risk relative to the market. The beta is calculated as β=Var(Rm)Cov(Ri,Rm), with Ri as the asset's return and Rm as the market return, derived from regressing excess asset returns on excess market returns. This approach allows investors to estimate expected returns and assess portfolio risk, forming a cornerstone for asset pricing and cost of capitaldetermination since its formulation in equilibrium theory.[67][67]In econometrics, linear regression models form the foundational starting point for analyzing economic data, estimating relationships between variables, testing economic theories, and forecasting outcomes such as GDP growth.[68][69] Economists utilize linear regression in regression discontinuity designs to infer causal effects from policy interventions at known cutoff thresholds, comparing outcomes immediately above and below the cutoff to isolate treatment impacts. For example, analyses of vote share cutoffs near zero have estimated incumbency advantages, showing discontinuities of 5-10% in vote shares and 37-57% in reelection probabilities. Other applications include evaluating class size reductions at enrollment thresholds, where regression models reveal effects on student test scores, and financial aid eligibility based on test scores, demonstrating boosts in college enrollment. These designs provide quasi-experimental rigor for assessing policy efficacy, such as in education and electoral systems, under the assumption of continuity in potential outcomes absent the cutoff.[70][70][70]In environmental science, linear regression models relate pollutant concentrations to covariates like land use, traffic density, and meteorological factors, including temperature, to map spatial and temporal variations. Land use regression, a form of multiple linear regression, has been used globally to predict nitrogen dioxide (NO₂) levels from variables such as proximity to major roads and satellite-derived emissions, explaining up to 54% of variation in annual concentrations across diverse regions. Temperature serves as a key covariate in such models, as higher values can exacerbate ozone formation or alter particulate matter (PM₂.₅) components like nitrate and sulfate through photochemical reactions and atmospheric stability changes. These applications support air quality forecasting and regulatory assessments, such as identifying emission hotspots influenced by seasonal temperature shifts.[71][71][72]Building science leverages linear regression to predict energy consumption based on insulation properties and related variables, aiding in the design of efficient structures. Models regress energy use intensity on factors like wall and roof U-values (measures of insulation effectiveness), infiltration rates, and window-to-wall ratios, validated against real-world data with errors under 10%. This approach integrates with building information modeling to optimize thermal performance and support green certification processes.[73][73][73]
Integration with Machine Learning
In machine learning pipelines, linear regression often begins with feature engineering to ensure model stability and interpretability. A key step is standardization, where each feature xj is transformed to xj′=σjxj−μj, centering the data at zero mean and unit variance. This preprocessing makes coefficients comparable across features with different scales, preventing those with larger variances from disproportionately influencing the fit, and is particularly beneficial for gradient-based optimization in linear models.[74]Popular machine learning libraries integrate linear regression as a core component, facilitating seamless use within broader workflows. For instance, scikit-learn's LinearRegression class implements ordinary least squares fitting and supports pipeline integration for tasks like prediction and evaluation. Regularized variants, such as ridge and lasso regression, incorporate cross-validation to tune the regularization parameter λ; RidgeCV and LassoCV automate this by evaluating multiple λ values via k-fold cross-validation, selecting the one minimizing validation error to balance bias and variance. Lasso, in particular, promotes sparsity by driving some coefficients to zero, aiding feature selection in high-dimensional data.[75][76]To capture nonlinearity while retaining the simplicity of linear regression, basis expansions transform the input space. Polynomial features extend the model to forms like y=β0+β1x+β2x2+ε, where higher-degree terms are generated via preprocessing (e.g., scikit-learn's PolynomialFeatures), allowing the linear framework to approximate curved relationships without altering the core algorithm. Splines provide a flexible alternative, using piecewise polynomials joined smoothly at knots to avoid the high-degree oscillations of global polynomials, as detailed in foundational statistical learning texts.[77]Linear regression also serves as a base learner in ensemble methods, enhancing predictive performance through iterative refinement. In gradient boosting, weak linear models are sequentially fitted to residuals of prior fits, building an additive ensemble that corrects errors additively; this approach, while often using trees, leverages linear bases for scenarios requiring interpretability over complexity.[78]A primary advantage of linear regression in machine learning is its inherent interpretability compared to black-box models like deep neural networks. The coefficients β directly quantify feature impacts on the target, but for deeper insights, SHAP values decompose predictions into additive feature attributions, aligning with game-theoretic fairness axioms and extending interpretability to ensemble or regularized linear models.
Historical Development
Early Origins
The foundations of linear regression trace back to the early 19th century, when astronomers sought precise methods to fit observational data to theoretical models, particularly for predicting celestial orbits. In 1805, French mathematician Adrien-Marie Legendre introduced the method of least squares in the appendix of his work Nouvelles méthodes pour la détermination des orbites des comètes, marking the first formal publication of this technique.[79][80] Legendre presented it as a practical tool for minimizing the sum of squared residuals between observed and predicted positions of comets, enabling more accurate orbital determinations amid noisy astronomical measurements.[79] This approach, though algorithmic, laid the groundwork for regression by emphasizing error minimization in linear relationships.[79]Shortly thereafter, in 1809, Carl Friedrich Gauss published his own formulation of the least squares method in the second volume of Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, applying it to astronomical data such as planetary and asteroid orbits.[81][82]Gauss claimed to have developed the technique as early as 1795, using it successfully to predict the position of the asteroid Ceres based on limited observations from Giuseppe Piazzi in 1801.[81] His work extended Legendre's deterministic method by incorporating a probabilistic framework, assuming that observational errors followed a normal distribution with mean zero, which justified least squares as the maximum-likelihood estimator for parameter recovery.[83][84] This error model, derived from the idea that errors arise from numerous small, independent causes, provided a theoretical basis for the normality assumption in regression analysis.[83]The term "regression" itself emerged later in the 19th century through studies of biological inheritance. In 1886, British polymath Francis Galton coined the phrase in his paper "Regression Towards Mediocrity in Hereditary Stature," published in the Journal of the Anthropological Institute of Great Britain and Ireland.[85] Analyzing data on the heights of 930 adult children and their 205 mid-parentages, Galton observed a tendency for offspring of exceptionally tall or short parents to have heights closer to the population average, a phenomenon he termed "regression towards mediocrity."[85] This empirical insight, visualized through scatter plots and fitted lines, highlighted mean reversion in linear relationships and connected least squares estimation to the study of correlated traits, influencing the statistical interpretation of regression.[85]
Key Milestones and Contributors
Building on Galton's empirical observations, Karl Pearson refined the mathematical framework of regression in the 1890s. In 1895, he developed the product-moment correlation coefficient to quantify the strength of linear relationships between variables, and by the early 1900s, he extended these ideas to multiple regression, providing equations for predicting a dependent variable from several independent ones and establishing a rigorous basis for biometrical analysis.[5]In the 1920s, Ronald Fisher laid foundational groundwork for modern linear regression analysis by developing the analysis of variance (ANOVA) framework, which decomposes total variance into components attributable to different sources, and introducing the F-test for assessing the significance of regression coefficients. These concepts were detailed in his seminal book Statistical Methods for Research Workers (1925), enabling researchers to evaluate the fit and explanatory power of linear models in experimental data.[86]During the 1930s, Jerzy Neyman and Egon Pearson advanced the integration of hypothesis testing into linear regression by formulating the Neyman-Pearson lemma, which identifies the most powerful tests for simple hypotheses based on likelihood ratios, thereby providing a rigorous basis for inference on model parameters such as slopes and intercepts. Their collaborative work, including the 1933 paper "On the Problem of the Most Efficient Tests of Statistical Hypotheses," established a decision-theoretic approach that complemented Fisher's methods and became essential for validating linear regression assumptions.[87]In the 1940s, Trygve Haavelmo propelled linear regression's application in econometrics by addressing biases in systems of simultaneous equations, where endogenous variables violate classical assumptions; his analysis demonstrated the need for identification strategies, paving the way for instrumental variables (IV) methods to obtain consistent estimates. This insight was articulated in his 1943 paper "The Statistical Implications of a System of Simultaneous Equations," which shifted econometric modeling toward probabilistic frameworks and influenced robust regression techniques.[88]Computational advancements in the 1960s enhanced the stability of linear regression estimation amid growing data volumes; Gene Golub's 1965 paper "Numerical Methods for Solving Linear Least Squares Problems" introduced the QR decomposition using Householder transformations, offering a numerically stable alternative to direct normal equations by orthogonalizing the design matrix and avoiding ill-conditioning issues.[89]Software developments in the 1970s democratized linear regression by embedding it in user-friendly tools; the Statistical Analysis System (SAS), originating from a 1966 project at North Carolina State University and incorporated in 1976, provided comprehensive procedures for regression modeling, including diagnostics and extensions, making advanced analysis accessible beyond specialists.[90]By the 1990s, the open-source R programming language, initiated in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland as an implementation of the S language, further popularized linear regression through its lm() function and extensible packages, fostering widespread adoption in statistical computing and reproducible research.[91]