Hubbry Logo
Ordinary least squaresOrdinary least squaresMain
Open search
Ordinary least squares
Community hub
Ordinary least squares
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Ordinary least squares
Ordinary least squares
from Wikipedia
Okun's law in macroeconomics states that in an economy the GDP growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one[clarification needed] effects of a linear function of a set of explanatory variables) by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.[1]

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.

The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments[2] and—by the Gauss–Markov theoremoptimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator.

Linear model

[edit]

Suppose the data consists of observations . Each observation includes a scalar response and a column vector of parameters (regressors), i.e., . In a linear regression model, the response variable, , is a linear function of the regressors:

or in vector form,

where , as introduced previously, is a column vector of the -th observation of all the explanatory variables; is a vector of unknown parameters; and the scalar represents unobserved random variables (errors) of the -th observation. accounts for the influences upon the responses from sources other than the explanatory variables . This model can also be written in matrix notation as

where and are vectors of the response variables and the errors of the observations, and is an matrix of regressors, also sometimes called the design matrix, whose row is and contains the -th observations on all the explanatory variables.

Typically, a constant term is included in the set of regressors , say, by taking for all . The coefficient corresponding to this regressor is called the intercept. Without the intercept, the fitted line is forced to cross the origin when .

Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent).

As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters ().

Matrix/vector formulation

[edit]

Consider an overdetermined system

of linear equations in unknown coefficients, , with . This can be written in matrix form as

where

(Note: for a linear model as above, not all elements in contains information on the data points. The first column is populated with ones, . Only the other columns contain actual data. So here is equal to the number of regressors plus one).

Such a system usually has no exact solution, so the goal is instead to find the coefficients which fit the equations "best", in the sense of solving the quadratic minimization problem

where the objective function is given by

A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the columns of the matrix are linearly independent, given by solving the so-called normal equations:

The matrix is known as the normal matrix or Gram matrix and the matrix is known as the moment matrix of regressand by regressors.[3] Finally, is the coefficient vector of the least-squares hyperplane, expressed as

or

Estimation

[edit]

Suppose b is a "candidate" value for the parameter vector β. The quantity yixiTb, called the residual for the i-th observation, measures the vertical distance between the data point (xi, yi) and the hyperplane y = xTb, and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called the error sum of squares (ESS) or residual sum of squares (RSS))[4] is a measure of the overall model fit:

where T denotes the matrix transpose, and the rows of X, denoting the values of all the independent variables associated with a particular value of the dependent variable, are Xi = xiT. The value of b which minimizes this sum is called the OLS estimator for β. The function S(b) is quadratic in b with positive-definite Hessian, and therefore this function possesses a unique global minimum at , which can be given by the explicit formula[5][proof]

The product N = XT X is a Gram matrix, and its inverse, Q = N−1, is the cofactor matrix of β,[6][7][8] closely related to its covariance matrix, Cβ. The matrix (XT X)−1 XT = Q XT is called the Moore–Penrose pseudoinverse matrix of X. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the Gram matrix to have no inverse).

Prediction

[edit]

After we have estimated β, the fitted values (or predicted values) from the regression will be

where P = X(XTX)−1XT is the projection matrix onto the space V spanned by the columns of X. This matrix P is also sometimes called the hat matrix because it "puts a hat" onto the variable y. Another matrix, closely related to P is the annihilator matrix M = InP; this is a projection matrix onto the space orthogonal to V. Both matrices P and M are symmetric and idempotent (meaning that P2 = P and M2 = M), and relate to the data matrix X via identities PX = X and MX = 0.[9] Matrix M creates the residuals from the regression:

The variances of the predicted values are found in the main diagonal of the variance-covariance matrix of predicted values:

where P is the projection matrix and s2 is the sample variance.[10] The full matrix is very large; its diagonal elements can be calculated individually as:

where Xi is the i-th row of matrix X.

Sample statistics

[edit]

Using these residuals we can estimate the sample variance s2 using the reduced chi-squared statistic:

The denominator, np, is the statistical degrees of freedom. The first quantity, s2, is the OLS estimate for σ2, whereas the second, , is the MLE estimate for σ2. The two estimators are quite similar in large samples; the first estimator is always unbiased, while the second estimator is biased but has a smaller mean squared error. In practice s2 is used more often, since it is more convenient for the hypothesis testing. The square root of s2 is called the regression standard error,[11] standard error of the regression,[12][13] or standard error of the equation.[9]

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X. The coefficient of determination R2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y, in the cases where the regression sum of squares equals the sum of squares of residuals:[14]

where TSS is the total sum of squares for the dependent variable, , and is an n×n matrix of ones. ( is a centering matrix which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for R2 to be meaningful, the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, R2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.

Simple linear regression model

[edit]

If the data matrix X contains only two variables, a constant and a scalar regressor xi, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as (α, β):

The least squares estimates in this case are given by simple formulas

Alternative derivations

[edit]

In the previous section the least squares estimator was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^β = (XTX)−1XTy; the only difference is in how we interpret this result.

Projection

[edit]
OLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of and refers to a column of the data matrix.)

For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations y, where β is the unknown. Assuming the system cannot be solved exactly (the number of equations n is much larger than the number of unknowns p), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies

where · is the standard L2 norm in the n-dimensional Euclidean space Rn. The predicted quantity is just a certain linear combination of the vectors of regressors. Thus, the residual vector y will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X. The OLS estimator in this case can be interpreted as the coefficients of vector decomposition of ^y = Py along the basis of X.

In other words, the gradient equations at the minimum can be written as:

A geometrical interpretation of these equations is that the vector of residuals, is orthogonal to the column space of X, since the dot product is equal to zero for any conformal vector, v. This means that is the shortest of all possible vectors , that is, the variance of the residuals is the minimum possible. This is illustrated at the right.

Introducing and a matrix K with the assumption that a matrix is non-singular and KT X = 0 (cf. Orthogonal projections), the residual vector should satisfy the following equation:

The equation and solution of linear least squares are thus described as follows:

Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[15] Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.

Maximum likelihood

[edit]

The OLS estimator is identical to the maximum likelihood estimator (MLE) under the normality assumption for the error terms.[16][proof] This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson.[citation needed] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound for variance) if the normality assumption is satisfied.[17]

Generalized method of moments

[edit]

In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions

These moment conditions state that the regressors should be uncorrelated with the errors. Since xi is a p-vector, the number of moment conditions is equal to the dimension of the parameter vector β, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.

Note that the original strict exogeneity assumption E[εi | xi] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E[ƒ(xiεi] = 0 will hold. However it can be shown using the Gauss–Markov theorem that the optimal choice of function ƒ is to take ƒ(x) = x, which results in the moment equation posted above.

Properties

[edit]

Assumptions

[edit]

There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.

One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (random design) the regressors xi are random and sampled together with the yi's from some population, as in an observational study. This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation (fixed design), the regressors X are treated as known constants set by a design, and y is sampled conditionally on the values of X as in an experiment. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on X. All results stated in this article are within the random design framework.

Classical linear regression model

[edit]

The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the behavior at a large number of samples is studied.

  • Correct specification. The linear functional form must coincide with the form of the actual data-generating process.
  • Strict exogeneity. The errors in the regression should have conditional mean zero:[18] The immediate consequence of the exogeneity assumption is that the errors have mean zero: E[ε] = 0 (for the law of total expectation), and that the regressors are uncorrelated with the errors: E[XTε] = 0.
    The exogeneity assumption is critical for the OLS theory. If it holds then the regressor variables are called exogenous. If it does not, then those regressors that are correlated with the error term are called endogenous,[19] and the OLS estimator becomes biased. In such case the method of instrumental variables may be used to carry out inference.
  • No linear dependence. The regressors in X must all be linearly independent. Mathematically, this means that the matrix X must have full column rank almost surely:[20] Usually, it is also assumed that the regressors have finite moments up to at least the second moment. Then the matrix Qxx = E[XTX / n] is finite and positive semi-definite.
    When this assumption is violated the regressors are called linearly dependent or perfectly multicollinear. In such case the value of the regression coefficient β cannot be learned, although prediction of y values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
  • Spherical errors:[20] where In is the identity matrix in dimension n, and σ2 is a parameter which determines the variance of each observation. This σ2 is considered a nuisance parameter in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient.
    It is customary to split this assumption into two parts:
    • Homoscedasticity: E[ εi2 | X ] = σ2, which means that the error term has the same variance σ2 in each observation. When this requirement is violated this is called heteroscedasticity, in such case a more efficient estimator would be weighted least squares. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the law of large numbers they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, robust estimation techniques are recommended.
    • No autocorrelation: the errors are uncorrelated between observations: E[ εiεj | X ] = 0 for ij. This assumption may be violated in the context of time series data, panel data, cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases generalized least squares provides a better alternative than the OLS. Another expression for autocorrelation is serial correlation.
  • Normality. It is sometimes additionally assumed that the errors have normal distribution conditional on the regressors:[21]This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the maximum likelihood estimator (MLE), and therefore it is asymptotically efficient in the class of all regular estimators. Importantly, the normality assumption applies only to the error terms; contrary to a popular misconception, the response (dependent) variable is not required to be normally distributed.[22]

Independent and identically distributed (iid)

[edit]

In some applications, especially with cross-sectional data, an additional assumption is imposed — that all observations are independent and identically distributed. This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size n → ∞), which are understood as a theoretical possibility of fetching new independent observations from the data generating process. The list of assumptions in this case is:

  • IID observations: (xi, yi) is independent from, and has the same distribution as, (xj, yj) for all i ≠ j;
  • No perfect multicollinearity: Qxx = E[ xi xiT ] is a positive-definite matrix;
  • Exogeneity: E[ εi | xi ] = 0;
  • Homoscedasticity: Var[ εi | xi ] = σ2.

Time series model

[edit]

Finite sample properties

[edit]

First of all, under the strict exogeneity assumption the OLS estimators and s2 are unbiased, meaning that their expected values coincide with the true values of the parameters:[24][proof]

If the strict exogeneity does not hold (as is the case with many time series models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.

The variance-covariance matrix (or simply covariance matrix) of is equal to[25]

In particular, the standard error of each coefficient is equal to square root of the j-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity σ2 with its estimate s2. Thus,

It can also be easily shown that the estimator is uncorrelated with the residuals from the model:[25]

The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be uncorrelated and homoscedastic) the estimator is efficient in the class of linear unbiased estimators. This is called the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimator which would be linear in y and unbiased, then [25]

in the sense that this is a nonnegative-definite matrix. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ε, other, non-linear estimators may provide better results than OLS.

Assuming normality

[edit]

The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the normality assumption holds (that is, that ε ~ N(0, σ2In)), then additional properties of the OLS estimators can be stated.

The estimator is normally distributed, with mean and variance as given before:[26]

This estimator reaches the Cramér–Rao bound for the model, and thus is optimal in the class of all unbiased estimators.[17] Note that unlike the Gauss–Markov theorem, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.

The estimator s2 will be proportional to the chi-squared distribution:[27]

The variance of this estimator is equal to 2σ4/(n − p), which does not attain the Cramér–Rao bound of 2σ4/n. However it was shown that there are no unbiased estimators of σ2 with variance smaller than that of the estimator s2.[28] If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error) estimator in this class will be ~σ2 = SSR / (n − p + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (p = 1).[29]

Moreover, the estimators and s2 are independent,[30] the fact which comes in useful when constructing the t- and F-tests for the regression.

Influential observations

[edit]

As was mentioned before, the estimator is linear in y, meaning that it represents a linear combination of the dependent variables yi. The weights in this linear combination are functions of the regressors X, and generally are unequal. The observations with high weights are called influential because they have a more pronounced effect on the value of the estimator.

To analyze which observations are influential we remove a specific j-th observation and consider how much the estimated quantities are going to change (similarly to the jackknife method). It can be shown that the change in the OLS estimator for β will be equal to [31]

where hj = xjT (XTX)−1xj is the j-th diagonal element of the hat matrix P, and xj is the vector of regressors corresponding to the j-th observation. Similarly, the change in the predicted value for j-th observation resulting from omitting that observation from the dataset will be equal to [31]

From the properties of the hat matrix, 0 ≤ hj ≤ 1, and they sum up to p, so that on average hjp/n. These quantities hj are called the leverages, and observations with high hj are called leverage points.[32] Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.

Partitioned regression

[edit]

Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form

where X1 and X2 have dimensions n×p1, n×p2, and β1, β2 are p1×1 and p2×1 vectors, with p1 + p2 = p.

The Frisch–Waugh–Lovell theorem states that in this regression the residuals and the OLS estimate will be numerically identical to the residuals and the OLS estimate for β2 in the following regression:[33]

where M1 is the annihilator matrix for regressors X1.

The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.

Constrained estimation

[edit]

Suppose it is known that the coefficients in the regression satisfy a system of linear equations

where Q is a p×q matrix of full rank, and c is a q×1 vector of known constants, where q < p. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint A. The constrained least squares (CLS) estimator can be given by an explicit formula:[34]

This expression for the constrained estimator is valid as long as the matrix XTX is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, β will not be identifiable. However it may happen that adding the restriction A makes β identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [35]

where R is a p×(p − q) matrix such that the matrix [Q R] is non-singular, and RTQ = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when XTX is invertible.[35]

Large sample properties

[edit]

The least squares estimators are point estimates of the linear regression model parameters β. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates.

Since we have not made any assumption about the distribution of error term εi, it is impossible to infer the distribution of the estimators and . Nevertheless, we can apply the central limit theorem to derive their asymptotic properties as sample size n goes to infinity. While the sample size is necessarily finite, it is customary to assume that n is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit.

We can show that under the model assumptions, the least squares estimator for β is consistent (that is converges in probability to β) and asymptotically normal:[proof]

where

Intervals

[edit]

Using this asymptotic distribution, approximate two-sided confidence intervals for the j-th component of the vector can be constructed as

  at the 1 − α confidence level,

where q denotes the quantile function of standard normal distribution, and [·]jj is the j-th diagonal element of a matrix.

Similarly, the least squares estimator for σ2 is also consistent and asymptotically normal (provided that the fourth moment of εi exists) with limiting distribution

These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The mean response is the quantity , whereas the predicted response is . Clearly the predicted response is a random variable, its distribution can be derived from that of :

which allows construct confidence intervals for mean response to be constructed:

  at the 1 − α confidence level.

Hypothesis testing

[edit]

Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The null hypothesis of no explanatory value of the estimated regression is tested using an F-test. If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the alternative hypothesis, that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.

Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's t-statistic, as the ratio of the coefficient estimate to its standard error. If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.

In addition, the Chow test is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.

Example with real data

[edit]

The following data set gives average heights and weights for American women aged 30–39 (source: The World Almanac and Book of Facts, 1975).

Height (m) 1.47 1.50 1.52 1.55 1.57
Scatterplot of the data, the relationship is slightly curved but close to linear
Weight (kg) 52.21 53.12 54.48 55.84 57.20
Height (m) 1.60 1.63 1.65 1.68 1.70
Weight (kg) 58.57 59.93 61.29 63.11 64.47
Height (m) 1.73 1.75 1.78 1.80 1.83
Weight (kg) 66.28 68.10 69.92 72.19 74.46

When only one dependent variable is being modeled, a scatterplot will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model:

Fitted regression

The output from most popular statistical packages will look similar to this:

Method Least squares
Dependent variable WEIGHT
Observations 15

Parameter Value Std error t-statistic p-value

128.8128 16.3083 7.8986 0.0000
–143.1620 19.8332 –7.2183 0.0000
61.9603 6.0084 10.3122 0.0000

R2 0.9989 S.E. of regression 0.2516
Adjusted R2 0.9987 Model sum-of-sq. 692.61
Log-likelihood 1.0890 Residual sum-of-sq. 0.7595
Durbin–Watson stat. 2.1013 Total sum-of-sq. 693.37
Akaike criterion 0.2548 F-statistic 5471.2
Schwarz criterion 0.3964 p-value (F-stat) 0.0000

In this table:

  • The Value column gives the least squares estimates of parameters βj
  • The Std error column shows standard errors of each coefficient estimate:
  • The t-statistic and p-value columns are testing whether any of the coefficients might be equal to zero. The t-statistic is calculated simply as . If the errors ε follow a normal distribution, t follows a Student-t distribution. Under weaker conditions, t is asymptotically normal. Large values of t indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, p-value, expresses the results of the hypothesis test as a significance level. Conventionally, p-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.
  • R-squared is the coefficient of determination indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors X have no explanatory power whatsoever. This is a biased estimate of the population R-squared, and will never decrease if additional regressors are added, even if they are irrelevant.
  • Adjusted R-squared is a slightly modified version of , designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than , can decrease as new regressors are added, and even be negative for poorly fitting models:
  • Log-likelihood is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.
  • Durbin–Watson statistic tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.
  • Akaike information criterion and Schwarz criterion are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.[36]
  • Standard error of regression is an estimate of σ, standard error of the error term.
  • Total sum of squares, model sum of squared, and residual sum of squares tell us how much of the initial variation in the sample were explained by the regression.
  • F-statistic tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has F(p–1,n–p) distribution under the null hypothesis and normality assumption, and its p-value indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as Wald test or LR test should be used.
Residuals plot

Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:

  • Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.
  • Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  • Residuals against the fitted values, .
  • Residuals against the preceding residual. This plot may identify serial correlations in the residuals.

An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.

Sensitivity to rounding

[edit]

This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is not an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become:

Const Height Height2
Converted to metric with rounding. 128.8128 −143.162 61.96033
Converted to metric without rounding. 119.0205 −131.5076 58.5046
Residuals to a quadratic fit for correctly and incorrectly converted data.

Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.

While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range (extrapolation).

This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the x and y errors.

Another example with less real data

[edit]

Problem statement

[edit]

We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is where is the radius of how far the object is from one of the bodies. In the equation the parameters and are used to determine the path of the orbit. We have measured the following data.

(in degrees) 43 45 52 93 108 116
4.7126 4.5542 4.0419 2.2187 1.8910 1.7599

We need to find the least-squares approximation of and for the given data.

Solution

[edit]

First we need to represent e and p in a linear form. So we are going to rewrite the equation as .

Furthermore, one could fit for apsides by expanding with an extra parameter as , which is linear in both and in the extra basis function .

We use the original two-parameter form to represent our observational data as:

where:

; ; contains the coefficients of in the first column, which are all 1, and the coefficients of in the second column, given by ; and , such that:

On solving we get ,

so and

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Ordinary least squares (OLS) is a statistical method for estimating the parameters of a model by minimizing the sum of the squared residuals between observed and predicted values. Developed independently by in 1805 and (who claimed prior use from 1795 and published in 1809), OLS forms the foundation of analysis and is widely applied in fields such as , , and social sciences for modeling relationships between variables. Under the Gauss-Markov theorem, OLS produces the best linear unbiased estimator () when certain assumptions hold, including linearity in parameters, independence of errors, homoscedasticity (constant variance of errors), no perfect among predictors, and zero mean errors. The method assumes the model is linear in its parameters—meaning the response variable is a of explanatory variables plus an error term—though the relationship need not be a straight line if transformations are applied. For inference, such as constructing confidence intervals, an additional assumption of normally distributed errors is often invoked, though it is not strictly necessary for large samples due to the . OLS is computationally straightforward, typically solved via normal equations derived from setting partial derivatives of the to , yielding closed-form solutions like the b=(YiYˉ)(XiXˉ)(XiXˉ)2b = \frac{\sum (Y_i - \bar{Y})(X_i - \bar{X})}{\sum (X_i - \bar{X})^2} for . It is sensitive to outliers and violations of assumptions like heteroscedasticity, which can lead to inefficient or biased estimates, prompting alternatives such as or in such cases. Despite these limitations, OLS remains a of statistical modeling due to its interpretability, efficiency with small datasets, and ability to provide prediction intervals when assumptions are met. The technique's historical roots in astronomy and underscore its enduring role in handling observational errors.

Model Formulation

Scalar Form

The ordinary least squares (OLS) method begins with the formulation of a linear regression model in scalar notation, which expresses the relationship between a dependent variable and one or more independent variables for each observation. In the simplest case of univariate , the model for a single observation is given by Y=β0+β1X+ε,Y = \beta_0 + \beta_1 X + \varepsilon, where YY is the response variable, XX is the predictor variable, β0\beta_0 is the intercept representing the of YY when X=0X=0, β1\beta_1 is the slope indicating the change in YY for a one-unit increase in XX, and ε\varepsilon is the random error term capturing unexplained variation. For the more general multivariate case with nn observations and kk predictors, the model is specified as Yi=β0+β1Xi1++βkXik+εi,i=1,,n,Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_k X_{ik} + \varepsilon_i, \quad i = 1, \dots, n, where YiY_i is the ii-th observed response, XijX_{ij} is the value of the jj-th predictor for the ii-th , βj\beta_j (for j=1,,kj=1,\dots,k) are the parameters measuring the effect of each predictor on YiY_i holding others constant, and β0\beta_0 is . The error terms εi\varepsilon_i are assumed to have zero, E(εi)=0E(\varepsilon_i) = 0, and constant variance, Var(εi)=σ2\mathrm{Var}(\varepsilon_i) = \sigma^2, ensuring the model's in parameters and homoscedasticity. This scalar form provides an intuitive, element-wise representation of the , originating from early applications in by in 1805 and in 1809, who developed the approach to fit orbits to astronomical data amid measurement errors. The notation can be extended compactly to vector and matrix forms for multivariate derivations.

Vector and Matrix Form

The vector and matrix formulation of ordinary least squares (OLS) extends the scalar representation of the model to handle multiple and predictors simultaneously, leveraging linear algebra for compact notation and computational efficiency. In this framework, the dependent variable is expressed as an n×1n \times 1 column vector Y\mathbf{Y}, where nn denotes the number of , compiling all response values yiy_i for i=1,,ni = 1, \dots, n. The vector is a (k+1)×1(k+1) \times 1 column vector β\boldsymbol{\beta}, encompassing the intercept β0\beta_0 and kk coefficients β1,,βk\beta_1, \dots, \beta_k. The error term becomes an n×1n \times 1 vector ε\boldsymbol{\varepsilon}, capturing the deviations for each . The core model equation in matrix form is Y=Xβ+ε\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where X\mathbf{X} is the n×(k+1)n \times (k+1) that organizes the predictor . This equation generalizes the scalar form yi=β0+j=1kβjxij+εiy_i = \beta_0 + \sum_{j=1}^k \beta_j x_{ij} + \varepsilon_i across all ii, enabling matrix operations for analysis. The X\mathbf{X} plays a pivotal role by structuring the input features, with its first column consisting of ones to accommodate the intercept term, followed by nn rows of the kk predictor variables xijx_{ij}. For and interpretability, the columns of X\mathbf{X} (excluding the intercept) are often centered by subtracting their means or scaled by dividing by their standard deviations, which does not alter the fitted model but mitigates issues like or ill-conditioning in computations. Partitioning X\mathbf{X} further distinguishes the intercept column from the predictor columns, such as X=[1Z]\mathbf{X} = [\mathbf{1} \mid \mathbf{Z}], where 1\mathbf{1} is the n×1n \times 1 vector of ones and Z\mathbf{Z} is the n×kn \times k matrix of centered or scaled predictors; this separation facilitates modular analysis of model components.

Estimation Methods

Least Squares Objective

The ordinary least squares (OLS) objective seeks to estimate the parameters of a model by minimizing the sum of squared residuals, which measures the discrepancy between observed and predicted values. In scalar form, for a model Yi=β0+j=1pβjXij+ϵiY_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ij} + \epsilon_i with i=1,,ni = 1, \dots, n, the objective function is S(β)=i=1n(Yiβ0j=1pβjXij)2,S(\boldsymbol{\beta}) = \sum_{i=1}^n \left( Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij} \right)^2, where β=(β0,β1,,βp)\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)^\top. Equivalently, in matrix notation, let Y\mathbf{Y} be the n×1n \times 1 vector of responses, X\mathbf{X} the n×(p+1)n \times (p+1) design matrix (including a column of ones for the intercept), and β\boldsymbol{\beta} the (p+1)×1(p+1) \times 1 parameter vector; the objective then becomes minimizing YXβ22\| \mathbf{Y} - \mathbf{X} \boldsymbol{\beta} \|^2_2, the squared Euclidean norm of the residual vector. Residuals are defined as the differences ei=YiY^ie_i = Y_i - \hat{Y}_i, where Y^i=β0+j=1pβjXij\hat{Y}_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ij} represents the fitted value for the ii-th under the estimated parameters. Squaring these residuals in the objective function penalizes larger deviations more severely than smaller ones, which emphasizes fitting accuracy for outliers, while also preventing positive and negative errors from canceling each other out in the sum. This further ensures a smooth, amenable to optimization techniques. Under the classical linear model assumptions (linearity, strict exogeneity, homoskedasticity, and no perfect ), minimizing this objective yields the best linear unbiased (BLUE) of β\boldsymbol{\beta}, meaning it has the minimum variance among all linear unbiased estimators, as established by the Gauss-Markov theorem. Geometrically, the objective corresponds to finding the point in the column space of X\mathbf{X} closest to Y\mathbf{Y} in , equivalent to the squared length of the from Y\mathbf{Y} to that subspace.

Closed-Form Estimator

The closed-form for the coefficients in ordinary (OLS) regression provides an explicit algebraic solution to the least squares objective, applicable when the satisfies certain conditions. In the vector and matrix formulation, the normal equations that define the OLS are XXβ=Xy\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{y}, where X\mathbf{X} is the n×(p+1)n \times (p+1) (including the intercept column of ), β\boldsymbol{\beta} is the (p+1)×1(p+1) \times 1 vector of coefficients, and y\mathbf{y} is the n×1n \times 1 response vector; these equations hold assuming XX\mathbf{X}^\top \mathbf{X} is invertible, which requires X\mathbf{X} to have full column rank. Solving the normal equations yields the closed-form β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}. For the case with a single predictor, the estimator simplifies to β^1=Cov(X,Y)Var(X)\hat{\beta}_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}, and the intercept to β^0=Yˉβ^1Xˉ\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}, where Xˉ\bar{X} and Yˉ\bar{Y} denote sample means. In practice, direct computation of (XX)1(\mathbf{X}^\top \mathbf{X})^{-1} can suffer from numerical instability if XX\mathbf{X}^\top \mathbf{X} is ill-conditioned due to or scaling issues; instead, of X=QR\mathbf{X} = \mathbf{QR} allows solving Rβ^=Qy\mathbf{R} \hat{\boldsymbol{\beta}} = \mathbf{Q}^\top \mathbf{y} for improved stability, while (SVD) of X=UΣV\mathbf{X} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top handles rank-deficient cases by effectively computing a pseudoinverse.

Derivations of Estimator

Geometric Projection

The ordinary least squares (OLS) admits a natural derivation through the lens of vector geometry in . Consider the response vector YRn\mathbf{Y} \in \mathbb{R}^n, where nn denotes the number of observations, and the XRn×p\mathbf{X} \in \mathbb{R}^{n \times p} with pp predictors (pnp \leq n). The columns of X\mathbf{X} span a pp-dimensional subspace C(X)Rn\mathcal{C}(\mathbf{X}) \subseteq \mathbb{R}^n. The OLS estimate β^\hat{\beta} selects the vector in this subspace closest to Y\mathbf{Y} in the Euclidean norm, such that the fitted values Y^=Xβ^\hat{\mathbf{Y}} = \mathbf{X} \hat{\beta} form the orthogonal projection of Y\mathbf{Y} onto C(X)\mathcal{C}(\mathbf{X}). This projection minimizes the squared distance YXβ22\| \mathbf{Y} - \mathbf{X} \beta \|_2^2 over all βRp\beta \in \mathbb{R}^p. The of the projection implies that the residual vector e=YY^\mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}} is to C(X)\mathcal{C}(\mathbf{X}), satisfying XTe=0\mathbf{X}^T \mathbf{e} = \mathbf{0}. Substituting the residual expression yields the normal equations XT(YXβ^)=0\mathbf{X}^T (\mathbf{Y} - \mathbf{X} \hat{\beta}) = \mathbf{0}, which uniquely determine β^\hat{\beta} when XTX\mathbf{X}^T \mathbf{X} is invertible (i.e., X\mathbf{X} has full column rank). This condition ensures the residuals lie in the of the column space, geometrically partitioning Y\mathbf{Y} into its projection onto C(X)\mathcal{C}(\mathbf{X}) and the . The projection operator is formalized by the hat matrix H=X(XTX)1XT\mathbf{H} = \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T, a symmetric and (HT=H\mathbf{H}^T = \mathbf{H} and H2=H\mathbf{H}^2 = \mathbf{H}) that maps any vector in Rn\mathbb{R}^n onto C(X)\mathcal{C}(\mathbf{X}). Thus, the fitted values are Y^=HY\hat{\mathbf{Y}} = \mathbf{H} \mathbf{Y}, and the residuals are e=(IH)Y\mathbf{e} = (\mathbf{I} - \mathbf{H}) \mathbf{Y}, where I\mathbf{I} is the . This matrix representation underscores how OLS geometrically "hats" the observed Y\mathbf{Y} by projecting it onto the subspace defined by the predictors. To illustrate in the context of simple linear regression (p=1p = 1), visualize the data as points (xi,yi)(x_i, y_i) in a 2D scatterplot. The column space C(X)\mathcal{C}(\mathbf{X}) is spanned by the vector of ones and the predictor vector x=(x1,,xn)T\mathbf{x} = (x_1, \dots, x_n)^T. The OLS fitted line y^=α^+β^x\hat{y} = \hat{\alpha} + \hat{\beta} x minimizes the sum of squared vertical distances from the points to the line, corresponding to the orthogonal projection of Y\mathbf{Y} onto this 2D subspace in Rn\mathbb{R}^n. Geometrically, the residuals appear as vertical segments in the plot, but their orthogonality condition ensures ei=0\sum e_i = 0 and xiei=0\sum x_i e_i = 0, aligning the line through the data centroid while perpendicular to the subspace in the full observation space.

Maximum Likelihood Approach

Under the assumption that the errors in the model are independent and identically distributed as normal random variables, the ordinary least squares (OLS) estimator can be derived as the maximum likelihood estimator (MLE) of the model parameters. Consider the classical Y=Xβ+ϵY = X\beta + \epsilon, where YY is an n×1n \times 1 vector of observations, XX is an n×pn \times p , β\beta is a p×1p \times 1 vector of unknown parameters, and ϵ\epsilon is an n×1n \times 1 error vector with ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2) i.i.d. for i=1,,ni = 1, \dots, n. The likelihood function for the parameters β\beta and σ2\sigma^2 given the data YY and XX is L(β,σ2Y,X)=(2πσ2)n/2exp(12σ2YXβ2),L(\beta, \sigma^2 \mid Y, X) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \|Y - X\beta\|^2 \right), which is proportional to (σ2)n/2exp(12σ2YXβ2)(\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \|Y - X\beta\|^2 \right). To obtain the MLE, maximize the log-likelihood \ell(\beta, \sigma^2 \mid Y, X) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \|Y - X\beta\|^2. $$ Maximizing this with respect to $ \beta $ for fixed $ \sigma^2 $ (or jointly) leads to the normal equations $ X^T X \beta = X^T Y $, whose solution is the OLS estimator $ \hat{\beta} = (X^T X)^{-1} X^T Y $.[](https://data.princeton.edu/wws509/notes/c2s2.html) This equivalence holds because the least squares objective directly corresponds to the exponential term in the likelihood under the normality assumption.[](https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf) The derivation requires the errors to be i.i.d. normal with mean zero and constant variance $ \sigma^2 $, which justifies the probabilistic interpretation of OLS as MLE.[](http://web.stanford.edu/~rjohari/teaching/notes/226_lecture11_inference.pdf) In this setting, the OLS estimator achieves full efficiency, meaning it has the smallest possible variance among all unbiased estimators, as per the Cramér-Rao lower bound.[](https://math.arizona.edu/~jwatkins/o-mle.pdf) Without normality, the Gauss-Markov theorem still establishes OLS as the best linear unbiased estimator (BLUE) with minimal variance among linear unbiased estimators, but the MLE framework provides broader efficiency guarantees only when normality holds.[](http://assets.press.princeton.edu/chapters/s6946.pdf) ### Method of Moments The method of moments derives the ordinary least squares (OLS) estimator by equating [population](/page/Population) moments implied by the [linear regression](/page/Linear_regression) model to their sample counterparts. In the [population](/page/Population), the model posits that the [conditional expectation](/page/Conditional_expectation) of the dependent variable $ Y $ given the regressors $ X $ is $ \mathbb{E}(Y \mid X) = X \beta $, where $ \beta $ is the vector of unknown parameters. This implies that the error term $ u = Y - X \beta $ satisfies $ \mathbb{E}(u \mid X) = 0 $, or equivalently, the unconditional moments $ \mathbb{E}(u) = 0 $ and $ \mathbb{E}(X' u) = 0 $. These moment conditions form the foundation for [estimation](/page/Estimation), as they express the [orthogonality](/page/Orthogonality) between the regressors and the errors.[](https://warwick.ac.uk/fac/soc/economics/staff/swarulampalam/panel/gmm.pdf) To obtain the sample estimator, the method of moments replaces the population expectations with sample averages. For a dataset of $ n $ observations, the sample analog of $ \mathbb{E}(X' u) = 0 $ is the condition \frac{1}{n} X' (Y - X \hat{\beta}) = 0, where $ X $ and $ Y $ are the $ n \times k $ [design matrix](/page/Design_matrix) and $ n \times 1 $ response vector, respectively, and $ \hat{\beta} $ is the [estimator](/page/Estimator). Solving this equation yields the normal equations $ X' X \hat{\beta} = X' Y $, and assuming $ X' X $ is invertible, the OLS estimator follows as \hat{\beta} = (X' X)^{-1} X' Y. This derivation highlights how OLS emerges directly from moment matching without invoking minimization of a quadratic form.[](https://www.reed.edu/economics/parker/s10/312/notes/Notes2.pdf) Within the broader method of moments framework, originally proposed by Pearson in [1894](/page/1894) and extended to the [generalized method of moments](/page/Generalized_method_of_moments) (GMM) by Hansen in 1982, OLS represents a special case where the moment conditions are linear in the parameters. The general setup involves a vector of moment functions $ g(\theta) = \mathbb{E}[m(Z, \theta)] = 0 $, with the sample analog $ \frac{1}{n} \sum_{i=1}^n m(z_i, \theta) = 0 $ solved for $ \theta $; for OLS, $ m(z_i, \beta) = x_i (y_i - x_i' \beta) $, making it exactly identified with as many moments as parameters. This linear structure simplifies computation and ensures the estimator coincides with the least squares solution. Compared to [maximum likelihood estimation](/page/Maximum_likelihood_estimation), which typically assumes a full [probability distribution](/page/Probability_distribution) for the errors (such as normality), the method of moments approach for OLS requires only the validity of these unconditional moment conditions, rendering it robust to some distributional misspecifications as long as the moments exist and the [orthogonality](/page/Orthogonality) holds.[](https://warwick.ac.uk/fac/soc/economics/staff/swarulampalam/panel/gmm.pdf) ## Assumptions and Violations ### Classical Assumptions The classical assumptions underlying ordinary least squares (OLS) regression, often referred to as the Gauss–Markov assumptions, provide the foundational conditions for the OLS estimator to achieve optimal properties in linear models. These assumptions ensure that the model is appropriately specified and that the errors behave in a manner that supports unbiased and efficient [estimation](/page/Estimation). Formally introduced by [Carl Friedrich Gauss](/page/Carl_Friedrich_Gauss) in his 1821 work *Theoria Combinationis Observationum Erroribus Minimis Obnoxiae*, the theorem was later generalized by [Andrey Markov](/page/Andrey_Markov) in 1912 to encompass broader linear models with correlated observations. Under these assumptions, the [Gauss–Markov theorem](/page/Gauss–Markov_theorem) establishes that the OLS estimator is the best linear unbiased estimator (BLUE), possessing the smallest variance among all linear unbiased estimators of the parameters.[](https://www.statlect.com/fundamentals-of-statistics/Gauss-Markov-theorem) The first assumption is **linearity in parameters**, which posits that the [conditional expectation](/page/Conditional_expectation) of the dependent variable given the regressors is a [linear function](/page/Linear_function) of those regressors. This is expressed as E(\mathbf{Y} \mid \mathbf{X}) = \mathbf{X} \boldsymbol{\beta}, where $\mathbf{Y}$ is the $n \times 1$ vector of observations, $\mathbf{X}$ is the $n \times (k+1)$ design matrix (including an intercept column), and $\boldsymbol{\beta}$ is the $(k+1) \times 1$ vector of parameters. This assumption requires the model to be correctly specified in its [linear form](/page/Linear_form), allowing the systematic component to capture the true relationship without omitted nonlinearities or misspecifications. The second key assumption is **strict exogeneity**, stating that the error term is uncorrelated with the regressors, such that E(\boldsymbol{\epsilon} \mid \mathbf{X}) = \mathbf{0}. This implies that the regressors are not systematically related to the unobserved factors captured by the errors, ensuring no endogeneity or [omitted variable bias](/page/Omitted-variable_bias) that would lead to inconsistent estimates. Strict exogeneity is crucial for the unbiasedness of the OLS estimator, as violations could introduce correlation between $\mathbf{X}$ and $\boldsymbol{\epsilon}$, biasing parameter estimates.[](https://statisticsbyjim.com/regression/gauss-markov-theorem-ols-blue/) Homoskedasticity forms the third assumption, requiring that the conditional variance of each error term is constant and equal to $\sigma^2$ across all observations, given the regressors: \text{Var}(\epsilon_i \mid \mathbf{X}) = \sigma^2, \quad i = 1, \dots, n. This equal variance condition, often combined with the spherical errors assumption (no autocorrelation, so $\text{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X}) = 0$ for $i \neq j$), ensures that the errors have a homoskedastic and uncorrelated structure. Homoskedasticity is essential for the efficiency of OLS, as it guarantees the minimum variance property within the class of linear unbiased estimators.[](https://www.statlect.com/fundamentals-of-statistics/Gauss-Markov-theorem) Finally, the assumption of **no perfect multicollinearity** requires that the design matrix $\mathbf{X}$ has full column rank, specifically $\text{rank}(\mathbf{X}) = k+1$, where $k$ is the number of explanatory variables. This prevents the regressors from being perfectly linearly dependent, which would make the parameters non-identifiable and the OLS estimator undefined. Without perfect multicollinearity, the inverse $\mathbf{X}^\top \mathbf{X}$ exists, allowing computation of the closed-form OLS solution. Collectively, these assumptions underpin the Gauss–Markov theorem's conclusion that OLS yields [BLUE](/page/Blue) estimates, a result that holds without requiring normality of the errors, though normality is often added for [inference](/page/Inference) purposes in finite samples.[](https://statisticsbyjim.com/regression/gauss-markov-theorem-ols-blue/) ### Heteroskedasticity and [Autocorrelation](/page/Autocorrelation) In ordinary least squares (OLS) regression, heteroskedasticity occurs when the conditional variance of the error term varies across observations, such that $\operatorname{Var}(\varepsilon_i \mid \mathbf{X}_i) = \sigma_i^2$, where $\sigma_i^2$ depends on the regressors $\mathbf{X}_i$ or other factors.[](https://www.ucl.ac.uk/~uctp41a/b203/lecture9.pdf) This violation of the classical assumption of homoskedasticity implies that the errors have unequal spreads, often increasing with the level of the predictors, as seen in [cross-sectional data](/page/Cross-sectional_data) like income regressions where variance rises with income levels.[](https://www.reed.edu/economics/parker/s12/312/notes/Notes8.pdf) Under heteroskedasticity, the OLS estimator $\hat{\beta}$ remains unbiased and consistent, meaning it converges in probability to the true $\beta$ as the sample size grows.[](https://stats.stackexchange.com/questions/378851/why-use-ols-when-it-is-assumed-there-is-heteroscedasticity) However, $\hat{\beta}$ becomes inefficient, exhibiting larger variance than the best linear unbiased [estimator](/page/Estimator), and the conventional standard errors are biased, typically underestimated, which invalidates t-tests, F-tests, and confidence intervals by overstating [statistical significance](/page/Statistical_significance).[](https://spureconomics.com/heteroscedasticity-causes-and-consequences/) To detect heteroskedasticity, the Breusch-Pagan test regresses the squared OLS residuals on the regressors and applies a [Lagrange multiplier](/page/Lagrange_multiplier) statistic that follows a [chi-squared distribution](/page/Chi-squared_distribution) under the null of homoskedasticity.[](https://edu.hansung.ac.kr/~jecon/art/BreuschPagan_1979.pdf) This test, proposed by Breusch and Pagan, is widely used for its simplicity and power against common forms of heteroskedasticity.[](https://www.semanticscholar.org/paper/A-simple-test-for-heteroscedasticity-and-random-vol-Breusch-Pagan/a05a732eaa9462ba7df9195dad17d78218533efd) Autocorrelation, or serial correlation, arises when the errors are not independent, with $\operatorname{Cov}(\varepsilon_i, \varepsilon_j) \neq 0$ for $i \neq j$, frequently in [time series](/page/Time_series) data due to omitted variables, [inertia](/page/Inertia), or measurement errors.[](https://online.stat.psu.edu/stat501/lesson/t/t.2/t.2.3-testing-and-remedial-measures-autocorrelation) Positive autocorrelation, where high errors follow high errors, is common in economic [time series](/page/Time_series) like GDP growth.[](https://www.investopedia.com/terms/d/durbin-watson-statistic.asp) Like heteroskedasticity, autocorrelation leaves the OLS $\hat{\beta}$ unbiased and consistent under standard conditions, but renders it inefficient with inflated variance; moreover, the usual standard errors are biased—often underestimated in cases of positive autocorrelation—leading to overstated t-statistics and unreliable inference.[](https://analystprep.com/study-notes/cfa-level-2/quantitative-method/explain-serial-correlation-and-how-it-affects-statistical-inference/) Both issues compound in panel or time series models, where ignoring them can mislead policy analysis by suggesting spurious precision.[](https://www3.nd.edu/~wevans1/econ30331/autocorrelation.pdf) The Durbin-Watson test detects first-order [autocorrelation](/page/Autocorrelation) by computing a statistic $d = \sum_{t=2}^n \frac{(\hat{e}_t - \hat{e}_{t-1})^2}{\sum_{t=1}^n \hat{e}_t^2}$, where $\hat{e}_t$ are OLS residuals, and comparing it to critical bounds; values near 2 indicate no autocorrelation, below 2 suggest positive, and above 2 negative.[](https://academic.oup.com/biomet/article-abstract/37/3-4/409/176531) Developed by Durbin and Watson, this test is approximate but effective for models without lagged dependents. Remedies for these violations prioritize correcting [inference](/page/Inference) without altering point estimates. [Heteroskedasticity-consistent standard errors](/page/Heteroskedasticity-consistent_standard_errors), introduced by [White](/page/White), estimate the [covariance matrix](/page/Covariance_matrix) as $\hat{V}(\hat{\beta}) = (X'X)^{-1} \left( \sum \hat{e}_i^2 x_i x_i' \right) (X'X)^{-1}$, providing valid t-tests even under unknown heteroskedasticity forms.[](https://crooker.faculty.unlv.edu/econ441/econ_papers/White-Heteroskedasticity-Correction-1980.pdf) For both heteroskedasticity and [autocorrelation](/page/Autocorrelation), [generalized least squares](/page/Generalized_least_squares) (GLS) transforms the model to $\tilde{Y} = ( \Sigma^{-1/2} X ) \beta + \tilde{\varepsilon}$ with $\operatorname{Var}(\tilde{\varepsilon}) = I$, yielding efficient estimates if the error [covariance](/page/Covariance) $\Sigma$ is known; feasible GLS (FGLS) estimates $\Sigma$ from OLS residuals for practical use.[](http://web.vu.lt/mif/a.buteikis/wp-content/uploads/2019/11/MultivariableRegression_4.pdf) These approaches restore efficiency and valid [inference](/page/Inference), with robust errors sufficing for large samples.[](https://www.bauer.uh.edu/rsusmel/phd/ec1-11.pdf) ### Non-Normality and Outliers In ordinary least squares (OLS) regression, the assumption of normally distributed errors is not required for the estimator to remain unbiased or consistent, provided the other Gauss-Markov conditions hold, such as zero conditional mean and homoskedasticity.[](https://pmc.ncbi.nlm.nih.gov/articles/PMC8613103/) However, non-normality can distort finite-sample inference procedures, including t-tests and F-tests for coefficient significance, as these rely on the exact normality of the error distribution for their validity; in such cases, asymptotic approximations or robust standard errors may be necessary to ensure reliable p-values and confidence intervals.[](https://www.carlislerainey.com/papers/heavy-tails.pdf) To detect non-normality in the residuals, the Jarque-Bera test is commonly applied, which assesses skewness and kurtosis against normal distribution expectations using the statistic $ JB = n \left( \frac{S^2}{6} + \frac{(K-3)^2}{24} \right) $, where $ n $ is the sample size, $ S $ is the sample skewness, and $ K $ is the sample kurtosis; under the null hypothesis of normality, this follows a chi-squared distribution with 2 degrees of freedom.[](https://www.jstor.org/stable/1403192) Outliers in OLS can manifest as points with large residuals, indicating poor model fit for that observation, or as leverage points, where the predictor values are distant from the bulk of the data in the design space, potentially pulling the fitted line toward them despite fitting well.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) Leverage is quantified by the diagonal elements $ h_{ii} $ of the hat matrix $ H = X(X^T X)^{-1} X^T $, with values exceeding $ \frac{2(k+1)}{n} $ (where $ k $ is the number of predictors and $ n $ the sample size) signaling high influence from the predictors alone.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) Cook's distance provides a unified measure of an observation's joint influence from both residual and leverage, defined as D_i = \frac{e_i^2}{(k+1) s^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}, where $ e_i $ is the i-th residual, $ s^2 $ is the mean squared error, and $ h_{ii} $ is the leverage; values of $ D_i > \frac{4}{n-k-1} $ typically indicate substantial influence, as they reflect how much the predicted values change if the i-th observation is removed.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) Influential observations are those whose deletion causes a notable shift in the OLS coefficient estimates $ \hat{\beta} $, often quantified by the difference $ \hat{\beta}_{(i)} - \hat{\beta} $, where $ \hat{\beta}_{(i)} $ excludes the i-th point; such points can bias the overall fit if they disproportionately affect the parameter vector.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) For instance, a single influential point might inflate or deflate specific coefficients, leading to misleading interpretations of relationships.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) To mitigate the effects of outliers and non-normality, robust regression methods downweight extreme residuals using M-estimators, such as the [Huber](/page/Huber) estimator, which minimizes a [loss function](/page/Loss_function) that is quadratic for small errors but linear for large ones, thereby reducing sensitivity to contamination while preserving efficiency under approximate normality.[](https://projecteuclid.org/journals/annals-of-statistics/volume-1/issue-5/Robust-Regression-Asymptotics-Conjectures-and-Monte-Carlo/10.1214/aos/1176342503.full) Alternatively, trimming involves excluding suspected outliers based on diagnostic measures before applying OLS, though this risks information loss if the points are not truly aberrant.[](https://projecteuclid.org/journals/annals-of-statistics/volume-1/issue-5/Robust-Regression-Asymptotics-Conjectures-and-Monte-Carlo/10.1214/aos/1176342503.full) Numerical issues, such as rounding errors in computed [data](/page/Data), can also behave like outliers by amplifying discrepancies in ill-conditioned designs, where small perturbations in one [observation](/page/Observation) propagate to distort the entire coefficient vector.[](https://www.jstor.org/stable/2348515) ## Finite-Sample Properties ### Unbiasedness and Efficiency Under the assumptions of [linearity](/page/Linearity) in parameters, strict exogeneity (E[ε | X] = 0), and no perfect [multicollinearity](/page/Multicollinearity), the ordinary least squares (OLS) estimator β̂ is unbiased in finite samples, meaning its [expected value](/page/Expected_value) equals the true [parameter](/page/Parameter) vector: E[β̂] = β.[](https://www.stat.berkeley.edu/~census/GaussMr2.pdf) This unbiasedness follows from the [linearity](/page/Linearity) of the OLS [estimator](/page/Estimator) in the observed [data](/page/Data). The [closed-form expression](/page/Closed-form_expression) for β̂ takes the [linear form](/page/Linear_form) β̂ = (X'X)^{-1} X' y, where y = Xβ + ε. Substituting yields β̂ = β + (X'X)^{-1} X' ε. Taking expectations gives E[β̂] = β + (X'X)^{-1} X' E[ε] = β, since E[ε] = 0 under the exogeneity assumption.[](http://www.unm.edu/~jikaczmarski/working_papers/gm_proof.pdf) Under the additional Gauss-Markov assumptions of homoskedasticity (Var(ε | X) = σ² I), the OLS [estimator](/page/Estimator) β̂ is the best linear unbiased [estimator](/page/Estimator) (BLUE), possessing the minimum variance among all linear unbiased estimators of β.[](https://www2.stat.duke.edu/courses/Spring23/sta211.01/slides/lec-8.pdf) Specifically, the [covariance matrix](/page/Covariance_matrix) of β̂ is Var(β̂) = σ² (X'X)^{-1}, which is the smallest possible in the positive semi-definite sense for any competing linear unbiased estimator.[](http://www.stat.ucla.edu/~nchristo/statistics100C/gauss_markov_theorem.pdf) The proof of the [BLUE](/page/Blue) property uses the [decomposition](/page/Decomposition) for any other linear unbiased [estimator](/page/Estimator) γ̂ = C y, where E[γ̂] = β implies C X = I. Let B = C - (X'X)^{-1} X', so B X = 0. Then Var(γ̂) = σ² C C' = σ² [((X'X)^{-1} X' + B) ((X'X)^{-1} X' + B)'] = σ² [(X'X)^{-1} + B B'], since the cross terms vanish (X' B' = (B X)' = 0). Thus, Var(γ̂) - Var(β̂) = σ² B B', which is positive semi-definite. This efficiency holds without requiring normality of the errors, relying solely on the classical assumptions for [linearity](/page/Linearity) and unbiasedness.[](https://www.stat.berkeley.edu/~census/GaussMr2.pdf)[](https://web.ics.purdue.edu/~jltobias/671/lecture_notes/regression4.pdf) ### Variance of Estimators Under the Gauss-Markov assumptions of the classical linear regression model—where the errors are uncorrelated, have constant variance σ², and zero mean—the ordinary least squares (OLS) estimator $\hat{\beta}$ has a finite-sample covariance matrix that can be derived directly from its expression as a linear function of the errors. Specifically, the model is $Y = X\beta + \epsilon$, with $\mathbb{E}(\epsilon) = 0$ and $\operatorname{Var}(\epsilon) = \sigma^2 I_n$. Substituting yields $\hat{\beta} = (X^\top X)^{-1} X^\top Y = \beta + (X^\top X)^{-1} X^\top \epsilon$. The covariance matrix then follows as $\operatorname{Var}(\hat{\beta}) = (X^\top X)^{-1} X^\top (\sigma^2 I_n) X (X^\top X)^{-1} = \sigma^2 (X^\top X)^{-1}$, assuming $X$ is non-stochastic or conditioned upon.[](https://www.stat.berkeley.edu/~census/general.pdf) Since σ² is unknown in practice, it is estimated using the residuals from the fitted model. The residual vector is $e = Y - X\hat{\beta} = (I_n - X(X^\top X)^{-1}X^\top) \epsilon = (I_n - P_X) \epsilon$, where $P_X$ is the [projection matrix](/page/Projection_matrix) onto the column space of $X$. The sum of squared residuals is $e^\top e = \|Y - X\hat{\beta}\|^2$, and the unbiased [estimator](/page/Estimator) of σ² is $s^2 = \frac{\|Y - X\hat{\beta}\|^2}{n - k - 1}$, where $n$ is the sample size and $k$ is the number of regressors (excluding [the intercept](/page/The_Intercept)). This estimator is unbiased under the Gauss-Markov assumptions. Under the additional assumption of normally distributed errors, $(n - k - 1)s^2 / \sigma^2 \sim \chi^2_{n-k-1}$.[](https://users.stat.umn.edu/~helwig/notes/mvlr-Notes.pdf) The estimated covariance matrix of $\hat{\beta}$ is then $s^2 (X^\top X)^{-1}$, and the standard errors of the individual coefficients are the square roots of its diagonal elements, i.e., $\operatorname{se}(\hat{\beta}_j) = \sqrt{s^2 \cdot [(X^\top X)^{-1}]_{jj}}$ for the $j$-th coefficient. These standard errors quantify the sampling variability of each $\hat{\beta}_j$ and form the basis for inference in finite samples.[](https://www.stat.berkeley.edu/~census/general.pdf) In partitioned regression, where the regressors are split as $X = [X_1 \, X_2]$ with corresponding coefficients $\beta = [\beta_1^\top \, \beta_2^\top]^\top$, the Frisch-Waugh-Lovell theorem provides a focused expression for the variance of the subset estimator $\hat{\beta}_1$. The theorem states that $\hat{\beta}_1$ equals the OLS coefficient from regressing the residuals of $Y$ on $X_2$ (denoted $y^{(2)}$) onto the residuals of $X_1$ on $X_2$ (denoted $X_1^{(2)}$), yielding $\hat{\beta}_1 = (X_1^{(2)\top} X_1^{(2)})^{-1} X_1^{(2)\top} y^{(2)}$. The conditional variance, given the estimation of $\beta_2$, is $\operatorname{Var}(\hat{\beta}_1 \mid \hat{\beta}_2) = \sigma^2 (X_1^{(2)\top} X_1^{(2)})^{-1}$, which isolates the variability attributable to the variables in $X_1$ after accounting for those in $X_2$. This result, originally demonstrated for partial regressions, facilitates computational efficiency and interpretation in models with grouped regressors.[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf)[](http://repec.wesleyan.edu/pdf/mlovell/2005012_lovell.pdf) ### Influential Data Points In ordinary least squares (OLS) [estimation](/page/Estimation), certain data points can disproportionately affect the [parameter](/page/Parameter) estimates in finite samples due to their position in the design space or the near-linear dependencies among predictors. These influential observations arise because the OLS [estimator](/page/Estimator) weights points based on their leverage, potentially leading to biased or unstable inferences if not identified. The impact is particularly pronounced in small samples, where the inverse moment matrix $(X'X)^{-1}$ amplifies the contribution of atypical points. Leverage quantifies the potential influence of the $i$-th observation on the fitted values solely through its predictor values, independent of the response. It is given by the diagonal element of the hat matrix $H = X(X'X)^{-1}X'$, specifically $h_{ii} = x_i'(X'X)^{-1}x_i$, where $x_i$ is the $i$-th row of the design matrix $X$. Values of $h_{ii}$ range from $1/n$ to 1, with an average of $p/n$ across all observations, where $n$ is the sample size and $p$ is the number of parameters; points with $h_{ii} > 2p/n$ are typically flagged as high-leverage. High-leverage points, often located far from the centroid of the predictors, can dominate the estimation even if their residuals are small, as they pull the fit toward their position in the $X$-space.[](https://dspace.mit.edu/bitstream/handle/1721.1/48325/linearregression00wels.pdf?sequence=1&isAllowed=y) To assess the actual change in individual [coefficient](/page/Coefficient) estimates, the DFBETAS [statistic](/page/Statistic) measures the standardized difference in the $j$-th [parameter](/page/Parameter) when the $i$-th [observation](/page/Observation) is deleted: \text{DFBETAS}{j(i)} = \frac{\hat{\beta}j - \hat{\beta}{j(i)}}{s{(i)} \sqrt{c_{jj}}}, where $\hat{\beta}_{j(i)}$ is the estimate without the $i$-th point, $s_{(i)}$ is the residual standard error from the deleted fit, and $c_{jj}$ is the $j$-th diagonal element of $(X'X)^{-1}$. A point is considered influential if $|\text{DFBETAS}_{j(i)}| > 2/\sqrt{n}$, indicating a notable shift in $\hat{\beta}_j$. This metric combines leverage with the residual discrepancy, highlighting observations that alter specific parameters.[](https://dspace.mit.edu/bitstream/handle/1721.1/48325/linearregression00wels.pdf?sequence=1&isAllowed=y) Multicollinearity exacerbates the influence of data points by inflating the variance of the estimators through large elements in $(X'X)^{-1}$. When predictors are highly correlated, the inverse matrix develops elevated diagonal and off-diagonal entries, particularly in directions of near-linear dependence, making coefficients sensitive to individual observations aligned with those dependencies. This variance inflation can render even moderate-leverage points highly influential, as small changes in $X$ or $y$ propagate disproportionately to $\hat{\beta}$. Diagnostics for collinearity, such as condition indices derived from the [singular value decomposition](/page/Singular_value_decomposition) of $X$, help identify these unstable configurations. In small samples, a single point can dominate the OLS fit; for instance, consider a simple linear regression with $n=5$ and one predictor, where four points cluster near the origin (e.g., $x = 1,2,3,4$; $y \approx x$) and the fifth lies far out ($x=10$, $y=15$). The leverage of the outlier exceeds 0.5, while others are below 0.1, causing the slope to tilt sharply toward it and overriding the trend from the cluster, as the $(X'X)^{-1}$ amplifies its weight. Removing this point reverses the slope sign, illustrating how finite-sample scarcity allows one observation to control the estimates. ## Asymptotic Properties ### Consistency In the context of ordinary least squares (OLS) estimation, consistency refers to the property that the estimator $\hat{\beta}$ converges in probability to the true parameter vector $\beta$ as the sample size $n$ approaches [infinity](/page/Infinity), denoted as $\operatorname{plim}_{n \to \infty} \hat{\beta} = \beta$. This large-sample convergence holds under relatively weak conditions, including a fixed number of parameters $k$, strict exogeneity $\mathbb{E}[u \mid X] = 0$, and a rank condition ensuring that the [design matrix](/page/Design_matrix) $X$ has full column rank asymptotically. Additionally, the data-generating process must satisfy [ergodicity](/page/Ergodicity) or independence and identical distribution (i.i.d.) assumptions to invoke the [law of large numbers](/page/Law_of_large_numbers) (LLN), along with finite second moments for the regressors and errors to ensure the necessary probability limits exist.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) The proof of consistency relies on rewriting the OLS estimator in probability limit form and applying Slutsky's theorem. Specifically, express $\hat{\beta} = \left( \frac{X'X}{n} \right)^{-1} \frac{X'y}{n}$, where $y = X\beta + u$. Under the stated assumptions, the LLN implies $\operatorname{plim}_{n \to \infty} \frac{X'X}{n} = Q$, a positive definite matrix representing the second moment of the regressors, and $\operatorname{plim}_{n \to \infty} \frac{X'y}{n} = Q\beta$, due to exogeneity ensuring $\mathbb{E}[Xu] = 0$. By continuous mapping and [Slutsky's theorem](/page/Slutsky's_theorem), $\operatorname{plim}_{n \to \infty} \hat{\beta} = Q^{-1} (Q\beta) = \beta$. Notably, this result does not require normality of the errors or homoskedasticity, as those assumptions are pertinent to finite-sample efficiency or asymptotic normality rather than point convergence.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf)[](https://bstewart.scholar.princeton.edu/document/205)[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf) Under slightly stronger moment conditions, such as finite fourth moments for the errors and regressors, the OLS estimator is $\sqrt{n}$-consistent, meaning $\sqrt{n} (\hat{\beta} - \beta)$ remains stochastically bounded and converges in distribution to a normal limit (though the distributional aspect is addressed elsewhere). This rate underscores the practical utility of OLS in large samples, where estimation error diminishes proportionally to $1/\sqrt{n}$.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) ### Asymptotic Normality Under suitable regularity conditions, the ordinary least squares (OLS) estimator $\hat{\beta}$ exhibits asymptotic normality, meaning that as the sample size $n$ approaches infinity, the scaled difference $\sqrt{n}(\hat{\beta} - \beta)$ converges in distribution to a [multivariate normal distribution](/page/Multivariate_normal_distribution). This property, which builds on the consistency of $\hat{\beta}$, is fundamental for large-sample inference in [linear regression](/page/Linear_regression) models.[](https://doi.org/10.1016/S1573-4412(05)80005-4) The asymptotic normality arises from applying the [central limit theorem](/page/Central_limit_theorem) (CLT) to the term $\frac{1}{\sqrt{n}} X' \epsilon$, where $X$ is the [design matrix](/page/Design_matrix) and $\epsilon$ is the error vector. Assuming independent and identically distributed (i.i.d.) errors with mean zero and finite variance $\sigma^2$, and that the regressors satisfy $\operatorname{plim}_{n \to \infty} \frac{1}{n} X'X = Q$ where $Q$ is positive definite, the CLT implies \frac{1}{\sqrt{n}} X' \epsilon \xrightarrow{d} N(0, \sigma^2 Q). Combined with the consistency of $\hat{\beta}$ and a [continuous mapping theorem](/page/Continuous_mapping_theorem) such as Slutsky's, this yields the key distributional result: \sqrt{n} (\hat{\beta} - \beta) \xrightarrow{d} N(0, \sigma^2 Q^{-1}), where $\sigma^2 Q^{-1}$ is the asymptotic [covariance matrix](/page/Covariance_matrix).[](https://doi.org/10.1016/S1573-4412(05)80005-4)[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) In the presence of heteroskedasticity, where the error variances $\operatorname{Var}(\epsilon_i | X_i) = \sigma_i^2$ may differ across observations but remain finite, the homoskedastic form no longer holds. Instead, the asymptotic covariance takes the "sandwich" form $\Omega = Q^{-1} \left( \operatorname{plim}_{n \to \infty} \frac{1}{n} X' \operatorname{diag}(\epsilon^2) X \right) Q^{-1}$, ensuring valid inference without assuming constant variance. This robust estimator, which adjusts for heteroskedasticity-consistent standard errors, was formalized in the seminal work on covariance matrix estimation.[](https://www.jstor.org/stable/1912934) ### Inference Procedures In large samples, inference procedures for the ordinary least squares (OLS) estimator $\hat{\beta}$ rely on its asymptotic normality to construct tests and [confidence](/page/Confidence) intervals for hypotheses about the true parameters $\beta$. These methods approximate the finite-sample distributions with normal or chi-squared distributions, providing robust inference even when classical assumptions like homoskedasticity or normality of errors are violated, as long as consistency and the [central limit theorem](/page/Central_limit_theorem) hold.[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf) The Wald test is a general framework for testing linear or nonlinear restrictions on $\beta$. For a hypothesis $H_0: g(\beta) = 0$, where $g$ is a function of dimension $r$, the test statistic is $W_N = N \cdot g(\hat{\beta})' \left[ \hat{G} \cdot \widehat{\text{Avar}}(\hat{\beta}) \cdot \hat{G}' \right]^{-1} g(\hat{\beta})$, which converges in distribution to $\chi^2_r$ under the null, with $\hat{G}$ as the Jacobian of $g$ evaluated at $\hat{\beta}$ and $\widehat{\text{Avar}}(\hat{\beta})$ as a consistent estimator of the asymptotic variance, often using the robust sandwich form $\hat{D}^{-1} \hat{C} \hat{D}^{-1}$, where $\hat{D} = N^{-1} \sum x_i x_i'$ and $\hat{C} = N^{-1} \sum \hat{\epsilon}_i^2 x_i x_i'$. For linear restrictions $H_0: R\beta = q$ with $R$ of dimension $J \times K$, this simplifies to $W = (R\hat{\beta} - q)' \left[ R \widehat{\text{Avar}}(\hat{\beta}) R' \right]^{-1} (R\hat{\beta} - q) \stackrel{a}{\sim} \chi^2_J$. The null is rejected if $W$ exceeds the critical value from the chi-squared distribution at the desired significance level.[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf)[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf) Asymptotic t-tests, often called z-tests in large samples, assess individual coefficients or simple linear combinations. For $H_0: \beta_j = \beta_{j0}$, the statistic is $z_j = \frac{\hat{\beta}_j - \beta_{j0}}{\text{se}(\hat{\beta}_j)}$, where $\text{se}(\hat{\beta}_j)$ is the square root of the $j$-th diagonal element of $\widehat{\text{Avar}}(\hat{\beta})$; under the null, $z_j \stackrel{a}{\sim} N(0,1)$. The test rejects if $|z_j|$ exceeds the critical value from the standard normal distribution, such as 1.96 for a 5% two-sided test. This procedure extends to any linear hypothesis $H_0: r' \beta = q$ via $z = \frac{r' \hat{\beta} - q}{\sqrt{r' \widehat{\text{Avar}}(\hat{\beta}) r}}$.[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf)[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf) Confidence intervals for $\beta$ or linear combinations follow directly from the asymptotic normality. A $(1 - \alpha) \times 100\%$ interval for $\beta_j$ is $\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{se}(\hat{\beta}_j)$, where $z_{\alpha/2}$ is the $(1 - \alpha/2)$-quantile of the standard normal distribution; for example, $\pm 1.96 \cdot \text{se}(\hat{\beta}_j)$ at the 95% level. For a vector $\beta$, the interval is $\hat{\beta} \pm z_{\alpha/2} \cdot \sqrt{\widehat{\text{Avar}}(\hat{\beta})}$, applied elementwise. These intervals capture the true $\beta$ with probability approaching $1 - \alpha$ as the sample size grows.[](https://www.schmidheiny.name/teaching/ols.pdf)[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf) The F-test for subsets of coefficients, such as testing joint significance of a group of regressors, is asymptotically equivalent to the Wald test under large $n$. For $H_0: R\beta = 0$ where $R$ selects $J$ coefficients, the statistic $F = \frac{1}{J} (R\hat{\beta})' \left[ R \widehat{\text{Avar}}(\hat{\beta}) R' \right]^{-1} (R\hat{\beta}) \stackrel{a}{\sim} \chi^2_J / J$, but is often scaled to match the Wald form for chi-squared approximation directly. This provides a test for restricted models, rejecting the null if the statistic exceeds the critical chi-squared value, and is particularly useful for comparing nested models in large samples.[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf) ## Prediction and Diagnostics ### Fitted Values and Residuals In ordinary least squares (OLS) regression, the fitted values represent the predicted response values based on the estimated model parameters. For a dataset with response vector $ \mathbf{y} $ (of length $ n $) and design matrix $ \mathbf{X} $ (of dimension $ n \times (k+1) $, including an intercept column), the fitted values are denoted $ \hat{\mathbf{y}} $ and computed as $ \hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}} $, where $ \hat{\boldsymbol{\beta}} $ is the OLS coefficient vector.[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf) Equivalently, in matrix notation, $ \hat{\mathbf{y}} = \mathbf{H} \mathbf{y} $, where $ \mathbf{H} = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top $ is known as the [hat matrix](/page/Hat).[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf) The [hat matrix](/page/Hat) $ \mathbf{H} $ is symmetric ($ \mathbf{H}^\top = \mathbf{H} $) and idempotent ($ \mathbf{H}^2 = \mathbf{H} $), properties that reflect its role as an orthogonal projection onto the column space of $ \mathbf{X} $.[](http://users.stat.umn.edu/~helwig/notes/mlr-Notes.pdf) Additionally, the trace of $ \mathbf{H} $, denoted $ \operatorname{tr}(\mathbf{H}) $, equals the number of parameters in the model, $ k+1 $, which quantifies the effective dimensionality of the projection.[](http://www.mysmu.edu/faculty/anthonytay/MFE/OLS_using_Matrix_Algebra.pdf) The residuals, denoted $ \mathbf{e} $, measure the discrepancies between the observed responses and the fitted values, defined as $ \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H}) \mathbf{y} $, where $ \mathbf{I} $ is the $ n \times n $ [identity matrix](/page/Identity_matrix).[](https://www.math.wm.edu/~leemis/otext3.pdf) By construction of the OLS estimator, the residuals satisfy two key orthogonality conditions: their sum is zero ($ \sum_{i=1}^n e_i = \mathbf{1}^\top \mathbf{e} = 0 $, where $ \mathbf{1} $ is a vector of [ones](/page/The_Ones)) and they are orthogonal to each column of $ \mathbf{X} $ ($ \mathbf{X}^\top \mathbf{e} = \mathbf{0} $). These properties ensure that the residuals capture the unexplained variation after accounting for the linear effects in $ \mathbf{X} $, and they hold under the standard OLS assumptions of full column rank in $ \mathbf{X} $. A primary summary statistic derived from the fitted values and residuals is the [coefficient of determination](/page/Coefficient_of_determination), $ R^2 $, which quantifies the proportion of total variation in $ \mathbf{y} $ explained by the model. It is given by R^2 = 1 - \frac{| \mathbf{e} |^2}{| \mathbf{y} - \bar{y} \mathbf{1} |^2} = 1 - \frac{\sum_{i=1}^n e_i^2}{\sum_{i=1}^n (y_i - \bar{y})^2}, where $ \| \cdot \|^2 $ denotes the squared Euclidean norm, $ \bar{y} $ is the [sample mean](/page/Mean) of $ \mathbf{y} $, and the denominator is the [total sum of squares](/page/Total_sum_of_squares) (SST).[](https://users.wfu.edu/cottrell/ecn215/regress_print.pdf) Equivalently, $ R^2 = \frac{\| \hat{\mathbf{y}} - \bar{y} \mathbf{1} \|^2}{\| \mathbf{y} - \bar{y} \mathbf{1} \|^2} $, the ratio of the [explained sum of squares](/page/Explained_sum_of_squares) (SSR) to SST.[](http://users.stat.umn.edu/~helwig/notes/slr-Notes.pdf) To account for model complexity and sample size, the adjusted $ R^2 $ penalizes the inclusion of additional parameters: R^2_{\text{adj}} = 1 - \frac{| \mathbf{e} |^2 / (n - k - 1)}{| \mathbf{y} - \bar{y} \mathbf{1} |^2 / (n - 1)}, which divides the mean squared residual by its unbiased degrees-of-freedom estimate and compares it to the unbiased total variance.[](https://stats.oarc.ucla.edu/spss/output/regression-analysis/) Both $ R^2 $ and $ R^2_{\text{adj}} $ range from 0 to 1, with higher values indicating better fit, though $ R^2_{\text{adj}} $ is preferred for model comparison across different $ k $.[](https://users.wfu.edu/cottrell/ecn215/regress_print.pdf) ### Confidence Intervals for Predictions In ordinary least squares (OLS) regression, confidence intervals quantify the uncertainty around the estimated [mean](/page/Mean) response at a given predictor vector $x_0$, while prediction intervals account for the additional variability in a new individual observation at the same $x_0$. The point estimate $\hat{Y}_0 = x_0^T \hat{\beta}$ centers both types of intervals, where $\hat{\beta}$ is the OLS coefficient vector.[](https://www.stat.purdue.edu/~fmliang/STAT512/lect3.pdf) The $(1 - \alpha) \times 100\%$ [confidence interval](/page/Confidence_interval) for the mean response $E(y \mid x_0)$ is constructed as \hat{Y}0 \pm t{n-k-1, 1-\alpha/2} , s , \sqrt{x_0^T (X^T X)^{-1} x_0}, where $t_{n-k-1, 1-\alpha/2}$ is the critical value from the [Student's t-distribution](/page/Student's_t-distribution) with $n - k - 1$ [degrees of freedom](/page/Degrees_of_freedom) ($n$ is the sample size and $k$ is the number of predictors), $s = \sqrt{\text{MSE}}$ is the residual [standard error](/page/Standard_error) with $\text{MSE} = \text{SSE}/(n - k - 1)$, and $X$ is the [design matrix](/page/Design_matrix) including an intercept column. This interval relies on the assumptions of [linearity](/page/Linearity), [independence](/page/Independence), homoscedasticity, and normality of errors in the OLS model.[](https://www.stat.purdue.edu/~fmliang/STAT512/lect3.pdf) For predicting a new response $y_0$ at $x_0$, the corresponding $(1 - \alpha) \times 100\%$ [prediction interval](/page/Prediction_interval) is \hat{Y}0 \pm t{n-k-1, 1-\alpha/2} , s , \sqrt{1 + x_0^T (X^T X)^{-1} x_0}. The added 1 under the [square root](/page/Square_root) incorporates the variance of the new [error](/page/Error) term $\sigma^2$, estimated by $s^2$, which explains why prediction intervals are always wider than [confidence](/page/Confidence) intervals for the mean response.[](https://www.stat.purdue.edu/~fmliang/STAT512/lect3.pdf) For large sample sizes, the finite-sample t-based intervals approximate asymptotic normal intervals by replacing $t_{n-k-1, 1-\alpha/2}$ with the standard normal critical value $z_{1-\alpha/2}$ (approximately 1.96 for $\alpha = 0.05$) and using the [consistent estimator](/page/Consistent_estimator) $s$ for $\sigma$. The asymptotic variance of $\hat{Y}_0$ is $\sigma^2 x_0^T (X^T X / n)^{-1} x_0$, leveraging the normality of the OLS estimator under standard regularity conditions.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) ### Model Diagnostics Model diagnostics in ordinary least squares (OLS) regression are post-estimation procedures used to validate key assumptions, including [linearity](/page/Linearity), homoskedasticity, normality of errors, absence of [multicollinearity](/page/Multicollinearity) among predictors, and lack of [autocorrelation](/page/Autocorrelation) in residuals. These diagnostics rely primarily on the residuals, defined as the differences between observed and predicted values, to identify potential model misspecifications that could [bias](/page/Bias) estimates or invalidate [inference](/page/Inference).[](https://online.stat.psu.edu/stat462/node/116/) Residual plots provide a visual assessment of several assumptions. The residuals versus fitted values plot evaluates [linearity](/page/Linearity) and homoskedasticity; under the assumptions, residuals should exhibit a random scatter around the horizontal line at zero, with no discernible patterns such as curves (indicating nonlinearity) or funnel shapes (indicating heteroskedasticity).[](https://library.virginia.edu/data/articles/diagnostic-plots) Deviations in this plot suggest the need for model adjustments, like [polynomial](/page/Polynomial) terms or variance-stabilizing transformations.[](https://www.itl.nist.gov/div898/[handbook](/page/Handbook)/pri/section2/pri24.htm) Similarly, the normal Q-Q (quantile-quantile) plot checks the normality assumption by comparing ordered residuals to theoretical quantiles of the [normal](/page/The_Normal) distribution; residuals aligning closely with the reference line support normality, while systematic deviations, such as heavy tails, indicate non-normality.[](https://library.virginia.edu/data/articles/diagnostic-plots) The Ramsey RESET (regression equation specification error test) addresses functional form misspecification by testing whether higher-order terms of the fitted values significantly improve the model. Introduced by Ramsey in [1969](/page/1969), the test involves augmenting the original OLS model with powers (typically squares and cubes) of the fitted values and performing an [F-test](/page/F-test) on the coefficients of these added terms; rejection of the null hypothesis (all added coefficients zero) signals omitted variables or incorrect functional form.[](https://www.jstor.org/stable/2984219) Multicollinearity among predictors is quantified using variance inflation factors (VIF), which measure how much the variance of an OLS [coefficient](/page/Coefficient) is inflated due to correlations with other predictors. For the j-th predictor, the VIF is calculated as \text{VIF}_j = \frac{1}{1 - R_j^2}, where $R_j^2$ is the [coefficient of determination](/page/Coefficient_of_determination) from an auxiliary OLS regression of $X_j$ on all other predictors; values exceeding 5 or 10 typically indicate high [multicollinearity](/page/Multicollinearity), potentially leading to unstable estimates, though the choice of threshold depends on context.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1970.10488699)[](https://online.stat.psu.edu/stat462/node/180/) Autocorrelation in residuals, common in time-series data, is tested using the Durbin-Watson statistic, which examines [first-order](/page/First-order) serial correlation. Developed by Durbin and Watson in 1950, the test statistic is d = \frac{\sum_{t=2}^n (e_t - e_{t-1})^2}{\sum_{t=1}^n e_t^2}, where $e_t$ are the residuals; $d$ ranges from 0 to 4, with values near 2 supporting the [null hypothesis](/page/Null_hypothesis) of no [autocorrelation](/page/Autocorrelation), while low ($d < 1.5$) or high ($d > 2.5$) values suggest positive or negative [autocorrelation](/page/Autocorrelation), respectively, often requiring adjustments like including lagged variables. Critical values for significance depend on sample size and number of regressors, available in Durbin-Watson tables.[](https://www.jstor.org/stable/2332391) ## Applications and Examples ### Simple Linear Regression Case In [simple linear regression](/page/Simple_linear_regression), ordinary least squares (OLS) estimates the linear relationship between a dependent variable $Y$ and a single independent variable $X$ by minimizing the sum of squared residuals. The model is expressed as $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, where $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\epsilon_i$ are the errors. The OLS estimators for these parameters are given by \hat{\beta}1 = \frac{\sum{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}, \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}, where $\bar{X}$ and $\bar{Y}$ are the sample means of $X$ and $Y$, respectively.[](https://www.amherst.edu/system/files/media/1287/SLR_Leastsquares.pdf) These estimators provide the best linear unbiased estimates under the Gauss-Markov assumptions. To assess the significance of the slope $\beta_1$, a t-test is conducted under the null hypothesis $H_0: \beta_1 = 0$. The test statistic is t = \frac{\hat{\beta}_1}{\text{se}(\hat{\beta}_1)}, wherethestandarderroris where the standard error is \text{se}(\hat{\beta}1) = \frac{s}{\sqrt{\sum{i=1}^n (X_i - \bar{X})^2}}, and $s = \sqrt{\frac{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}{n-2}}$ is the residual standard error.[](https://www.stat.cmu.edu/~hseltman/309/Book/chapter9.pdf) The [t-statistic](/page/T-statistic) follows a t-distribution with $n-2$ [degrees of freedom](/page/Degrees_of_freedom) under the [null hypothesis](/page/Null_hypothesis), allowing for [p-value](/page/P-value) computation to determine if the predictor significantly explains variation in the response. The [coefficient of determination](/page/Coefficient_of_determination), $R^2$, measures the [goodness of fit](/page/Goodness_of_fit) and is interpreted as the proportion of the total variance in $Y$ explained by the model: R^2 = 1 - \frac{\sum_{i=1}^n (Y_i - \hat{Y}i)^2}{\sum{i=1}^n (Y_i - \bar{Y})^2} = \frac{\sum_{i=1}^n (\hat{Y}i - \bar{Y})^2}{\sum{i=1}^n (Y_i - \bar{Y})^2}. Values of $R^2$ closer to 1 indicate a stronger linear relationship.[](https://online.stat.psu.edu/stat462/node/95/) ### Example Computation Consider a dataset on the effects of LSD concentration ($X$) on math performance scores ($Y$), with $n=7$ observations: (1.17, 78.93), (2.97, 58.20), (3.26, 67.47), (4.69, 37.47), (5.83, 45.65), (6.00, 32.92), (6.41, 29.97). The sample means are $\bar{X} \approx 4.333$ and $\bar{Y} \approx 50.087$. Using the OLS formulas, the slope is $\hat{\beta}_1 \approx -9.009$ and the intercept is $\hat{\beta}_0 \approx 89.123$, yielding the fitted line $\hat{Y} = 89.123 - 9.009X$. The residual standard error is $s \approx 7.129$, and $\text{se}(\hat{\beta}_1) \approx 1.500$. The t-statistic for testing $\beta_1 = 0$ is $t \approx -6.01$ (df = 5, p < 0.001), indicating a significant negative relationship. Finally, $R^2 \approx 0.878$, meaning the model explains about 87.8% of the variance in scores.[](https://users.stat.ufl.edu/~winner/sta6208/notes1.pdf) ### Multiple Regression with Real Data To illustrate the application of ordinary least squares (OLS) in multiple regression, consider Francis Galton's classic 1885 dataset on family heights, which records the heights of 934 adult children from 205 English families, including separate measurements for fathers, mothers, and children (with heights in inches). This dataset allows modeling child height as a function of both parental heights, accounting for potential gender differences in inheritance patterns. A common approach separates analyses by child gender to capture distinct effects, using OLS to estimate the parameters. The [design matrix](/page/Design_matrix) $ \mathbf{X} $ is constructed as an $ n \times (p+1) $ matrix, where $ n $ is the number of observations (e.g., 481 for sons), the first column is a vector of [ones](/page/The_Ones) for the [intercept](/page/The_Intercept), and subsequent columns contain the predictors: father's height and mother's height. The response vector $ \mathbf{y} $ contains child heights. The OLS estimator $ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $ is computed, typically via statistical software like [R](/page/R) or Python's statsmodels library for [numerical stability](/page/Numerical_stability), as inverting $ \mathbf{X}^T \mathbf{X} $ directly can be ill-conditioned for larger datasets. For sons, the fitted model is $ \hat{y} = \beta_0 + \beta_1 \cdot \text{[father height](/page/Height)} + \beta_2 \cdot \text{[mother height](/page/Height)} $, with least-squares estimates $ \hat{\beta}_1 = 0.36 $ and $ \hat{\beta}_2 = 0.25 $, indicating that a one-inch increase in [father's height](/page/Height) predicts a 0.36-inch increase in son's height, holding [mother's height](/page/Height) constant, while the mother's effect is smaller but positive. For daughters, the coefficients are $ \hat{\beta}_1 \approx 0.31 $ for [father](/page/Father) and $ \hat{\beta}_2 \approx 0.28 $ for [mother](/page/Mother), showing comparable parental influences.[](https://arxiv.org/pdf/1508.02942) The multiple $ R^2 $ is approximately 0.45 for sons and 0.42 for daughters, explaining 42-45% of the variance in child [height](/page/Height). These results highlight how OLS isolates partial effects in multivariate settings, revealing nuanced [heritability](/page/Heritability) patterns originally noted by Galton. Model diagnostics include plotting residuals (observed minus fitted values) against fitted values to assess linearity and homoscedasticity; in Galton's data, residual plots show no strong patterns, supporting the assumptions, though some heteroscedasticity appears at height extremes. The $ R^2 = 0.66 $ in extended models incorporating gender as a dummy variable (e.g., adding a binary indicator for male children) improves fit by accounting for average sex-based height differences of about 5 inches. For a modern multivariate example, the [Boston](/page/Boston) Housing dataset (506 observations from 1970 U.S. [Census](/page/Census) tracts) models [median](/page/Median) housing value (MV, in &#36;1,000s) against 13 predictors, including structural factors like average rooms per dwelling (RM), socioeconomic indicators like lower-status population proportion (LSTAT), and environmental variables like nitrogen oxide concentration (NOX). The [design matrix](/page/Design_matrix) $ \mathbf{X} $ includes an intercept column plus these predictors, and $ \hat{\beta} $ is again estimated via OLS software. A semilog specification, $ \log(\text{MV}) = \beta_0 + \sum \beta_k x_k $, yields an $ R^2 = 0.81 $, indicating strong explanatory power.[](https://www.journals.elsevier.com/journal-of-environmental-economics-and-management) Key coefficients from the hedonic model include $ \hat{\beta}_{\text{RM}} \approx 0.11 $ (a one-room increase raises log value by 0.11, or about 12% at [mean](/page/Mean) MV), $ \hat{\beta}_{\text{LSTAT}} = -0.015 $ (1% higher low-status [population](/page/Population) lowers log value by 1.5%), and $ \hat{\beta}_{\text{NOX}^2} = -0.0064 $ (quadratic term capturing nonlinear [air pollution](/page/Air_pollution) effects, reducing value by roughly &#36;1,613 per pphm NOX increase at means). These interpretations reveal trade-offs, such as how [accessibility](/page/Accessibility) (e.g., distance to [employment](/page/Employment), DIS) positively affects values while [crime](/page/Crime) (CRIM) and taxes (TAX) negatively do. Residual plots versus fitted values confirm approximate linearity, with $ R^2 = 0.81 $ underscoring OLS's utility in policy-relevant hedonic pricing, though multicollinearity among predictors like industrial proportion (INDUS) and NOX warrants caution in [inference](/page/Inference).

References

Add your contribution
Related Hubs
User Avatar
No comments yet.