Hubbry Logo
Omitted-variable biasOmitted-variable biasMain
Open search
Omitted-variable bias
Community hub
Omitted-variable bias
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Omitted-variable bias
Omitted-variable bias
from Wikipedia

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included.

More specifically, OVB is the bias that appears in the estimates of parameters in a regression analysis, when the assumed specification is incorrect in that it omits an independent variable that is a determinant of the dependent variable and correlated with one or more of the included independent variables.

In linear regression

[edit]

Intuition

[edit]

Suppose the true cause-and-effect relationship is given by:

with parameters a, b, c, dependent variable y, independent variables x and z, and error term u. We wish to know the effect of x itself upon y (that is, we wish to obtain an estimate of b).

Two conditions must hold true for omitted-variable bias to exist in linear regression:

  • the omitted variable must be a determinant of the dependent variable (i.e., its true regression coefficient must not be zero); and
  • the omitted variable must be correlated with an independent variable specified in the regression (i.e., cov(z,x) must not equal zero).

Suppose we omit z from the regression, and suppose the relation between x and z is given by

with parameters d, f and error term e. Substituting the second equation into the first gives

If a regression of y is conducted upon x only, this last equation is what is estimated, and the regression coefficient on x is actually an estimate of (b + cf ), giving not simply an estimate of the desired direct effect of x upon y (which is b), but rather of its sum with the indirect effect (the effect f of x on z times the effect c of z on y). Thus by omitting the variable z from the regression, we have estimated the total derivative of y with respect to x rather than its partial derivative with respect to x. These differ if both c and f are non-zero.

The direction and extent of the bias are both contained in cf, since the effect sought is b but the regression estimates b+cf. The extent of the bias is the absolute value of cf, and the direction of bias is upward (toward a more positive or less negative value) if cf > 0 (if the direction of correlation between y and z is the same as that between x and z), and it is downward otherwise.

Detailed analysis

[edit]

As an example, consider a linear model of the form

where

  • xi is a 1 × p row vector of values of p independent variables observed at time i or for the i th study participant;
  • β is a p × 1 column vector of unobservable parameters (the response coefficients of the dependent variable to each of the p independent variables in xi) to be estimated;
  • zi is a scalar and is the value of another independent variable that is observed at time i or for the i th study participant;
  • δ is a scalar and is an unobservable parameter (the response coefficient of the dependent variable to zi) to be estimated;
  • ui is the unobservable error term occurring at time i or for the i th study participant; it is an unobserved realization of a random variable having expected value 0 (conditionally on xi and zi);
  • yi is the observation of the dependent variable at time i or for the i th study participant.

We collect the observations of all variables subscripted i = 1, ..., n, and stack them one below another, to obtain the matrix X and the vectors Y, Z, and U:

and

If the independent variable z is omitted from the regression, then the estimated values of the response parameters of the other independent variables will be given by the usual least squares calculation,

(where the "prime" notation means the transpose of a matrix and the -1 superscript is matrix inversion).

Substituting for Y based on the assumed linear model,

On taking expectations, the contribution of the final term is zero; this follows from the assumption that U is uncorrelated with the regressors X. On simplifying the remaining terms:

The second term after the equal sign is the omitted-variable bias in this case, which is non-zero if the omitted variable z is correlated with any of the included variables in the matrix X (that is, if X′Z does not equal a vector of zeroes). Note that the bias is equal to the weighted portion of zi which is "explained" by xi.

Effect in ordinary least squares

[edit]

The Gauss–Markov theorem states that regression models which fulfill the classical linear regression model assumptions provide the most efficient, linear and unbiased estimators. In ordinary least squares, the relevant assumption of the classical linear regression model is that the error term is uncorrelated with the regressors.

The presence of omitted-variable bias violates this particular assumption. The violation causes the OLS estimator to be biased and inconsistent. The direction of the bias depends on the estimators as well as the covariance between the regressors and the omitted variables. A positive covariance of the omitted variable with both a regressor and the dependent variable will lead the OLS estimate of the included regressor's coefficient to be greater than the true value of that coefficient. This effect can be seen by taking the expectation of the parameter, as shown in the previous section.

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Omitted-variable bias (OVB), also known as omitted variable bias, is a fundamental issue in and where the exclusion of one or more relevant explanatory variables from a leads to biased and inconsistent estimates of the s for the included variables. This bias arises when the omitted variable is correlated with at least one included independent variable and directly influences the dependent variable, violating the assumption of exogeneity in the model. Formally, in a context, the bias in the estimated β^1\hat{\beta}_1 for an included regressor xx due to omitting a variable ww is given by β2Cov(x,w)Var(x)\beta_2 \cdot \frac{\text{Cov}(x, w)}{\text{Var}(x)}, where β2\beta_2 is the true effect of ww on the dependent variable yy, assuming β20\beta_2 \neq 0 and Cov(x,w)0\text{Cov}(x, w) \neq 0. The consequences of OVB are significant, as it can inflate, deflate, or even reverse the sign of estimated effects, leading to erroneous conclusions about causal relationships. For instance, in studies examining the impact of on children's attentional problems, omitting factors like family —which influences both screen time and attention—can bias the estimated effect upward. In more complex scenarios, adjusting for observed confounders can amplify the bias from omitted variables through mechanisms like increased imbalance in the confounder distribution or the cancellation of offsetting biases from multiple sources. OVB is particularly prevalent in observational data analyses across fields such as , , and , where unmeasured confounders like or genetic factors are common. To mitigate OVB, researchers employ strategies including the inclusion of proxy variables, instrumental variable methods, or sensitivity analyses to assess the robustness of findings to potential omissions. Despite these remedies, detecting OVB remains challenging without or natural experiments, underscoring its status as a primary threat to valid causal estimation in non-experimental research.

Fundamentals

Definition

Omitted-variable bias (OVB), also known as omitted variable bias, is a form of model misspecification in statistical analysis where one or more relevant explanatory variables are excluded from the model, resulting in biased and inconsistent estimates of the coefficients for the included variables. This bias arises because the omitted variables influence the relationship between the dependent variable and the included independent variables, leading to systematic errors in parameter estimation. A variable is considered relevant—and thus its omission problematic—if it is correlated with the dependent variable and with at least one of the included independent variables. OVB represents a specific type of endogeneity in models, where the explanatory variables are correlated with the error term due to the excluded factors, but it differs from other sources of endogeneity such as simultaneity (reverse ) or measurement error in variables. The term and its implications emerged in the mid-20th century literature, with conceptual roots in early critiques of , including Trygve Haavelmo's foundational 1944 discussion of omitted factors in probabilistic economic modeling. While applicable across various statistical frameworks, OVB is most prominently analyzed in the context of models.

Causes

Omitted-variable bias primarily arises from the exclusion of a relevant explanatory variable ZZ in a regression model, where ZZ is correlated with an included regressor XX such that Cov(X,Z)0\text{Cov}(X, Z) \neq 0, and ZZ genuinely influences the outcome variable YY with a nonzero true γ\gamma. This correlation causes the effect of ZZ on YY to be incorrectly attributed to XX, distorting the estimated on XX. Such omissions are common in empirical analyses where not all potential influences can be anticipated or measured. Secondary causes of omitted-variable bias stem from broader model misspecification, including theoretical oversight where researchers neglect variables identified in prior or domain expertise as affecting the outcome. Data limitations, such as the unavailability of reliable measurements for key variables due to collection constraints or privacy issues, also contribute by forcing exclusions. Additionally, reliance on proxy variables that incompletely represent the underlying factors can leave residual effects unaccounted for, effectively creating an omission. In , omitted variables frequently serve as , simultaneously influencing both the explanatory variable XX (treatment) and the outcome YY, which breaches the exogeneity assumption that regressors are uncorrelated with the error term. This leads to spurious associations, undermining the validity of causal claims derived from the model. does not occur under specific conditions: if Cov(X,Z)=0\text{Cov}(X, Z) = 0, the omitted ZZ exerts no influence through XX; or if γ=0\gamma = 0, ZZ has no direct effect on YY, rendering its omission harmless.

Linear Regression Context

Intuition

Omitted-variable bias arises when a model incorrectly attributes the effects of an unmeasured factor to the included variables, much like attributing a child's solely to diet while ignoring , which would overstate the impact of alone. In this analogy, strongly influences but correlates with dietary quality through family , leading the diet to capture combined influences and exaggerate its true effect. Consider a in wage determination: suppose a regression models as a function of but omits innate , which positively affects both education attainment and potential. The education then absorbs part of ability's effect, biasing it upward to reflect not just schooling's direct return but also the higher wages of more able individuals who pursue more education. This illustrates how the omitted factor distorts the estimated causal relationship. The direction of omitted-variable bias hinges on the correlations involved; for instance, if the omitted variable positively correlates with an included explanatory variable and positively affects the outcome, the bias typically pushes the coefficient away from zero in the positive direction. A common misconception is that any omitted variable causes bias, but only those correlated with included variables and influencing the dependent variable lead to systematic distortion; irrelevant omissions produce no bias.

Mathematical Formulation

In the linear regression framework, consider the true population model where the outcome variable YY depends on an included regressor XX and an omitted variable ZZ: Y=β0+β1X+γZ+ε,Y = \beta_0 + \beta_1 X + \gamma Z + \varepsilon, with E(εX,Z)=0E(\varepsilon \mid X, Z) = 0. When ZZ is omitted, the misspecified model becomes Y=β0+β1X+u,Y = \beta_0^* + \beta_1^* X + u, where the composite error term is u=γZ+εu = \gamma Z + \varepsilon. Under standard assumptions for (OLS) —linearity in parameters, random sampling, zero conditional mean of the error given the regressors, and no perfect —the OLS β^1\hat{\beta}_1^* from the omitted model is inconsistent for β1\beta_1 if γ0\gamma \neq 0 and XX is correlated with ZZ. These assumptions include strict exogeneity, E(εX,Z)=0E(\varepsilon \mid X, Z) = 0, and homoskedasticity for unbiased variance , though the in β^1\hat{\beta}_1^* persists asymptotically even without homoskedasticity, leading to inconsistency. To derive the bias, substitute the true model into the OLS for the slope in the simple regression of YY on XX: β^1=Cov(Y,X)Var(X).\hat{\beta}_1^* = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}. Inserting Y=β0+β1X+γZ+εY = \beta_0 + \beta_1 X + \gamma Z + \varepsilon yields Cov(Y,X)=β1Var(X)+γCov(X,Z)+Cov(X,ε).\text{Cov}(Y, X) = \beta_1 \text{Var}(X) + \gamma \text{Cov}(X, Z) + \text{Cov}(X, \varepsilon). Under the zero conditional mean assumption, Cov(X,ε)=0\text{Cov}(X, \varepsilon) = 0, so β^1=β1+γCov(X,Z)Var(X).\hat{\beta}_1^* = \beta_1 + \gamma \frac{\text{Cov}(X, Z)}{\text{Var}(X)}. Thus, the expected bias is E(β^1β1)=γCov(X,Z)Var(X).E(\hat{\beta}_1^* - \beta_1) = \gamma \frac{\text{Cov}(X, Z)}{\text{Var}(X)}. This expression shows that the bias equals γ\gamma times the population regression coefficient of ZZ on XX, denoted δ=Cov(X,Z)Var(X)\delta = \frac{\text{Cov}(X, Z)}{\text{Var}(X)} from the auxiliary regression Z=δ0+δX+vZ = \delta_0 + \delta X + v. An alternative derivation uses direct projection or the , which decomposes the multiple regression on XX as the simple regression of YY on the residuals of XX after projecting out ZZ. Omitting ZZ fails to partial out its influence, leaving the bias term γδ\gamma \delta from the correlation between XX and the composite error uu.

Consequences

Bias in Estimators

Omitted-variable bias (OVB) renders estimators inconsistent, as the parameter estimates fail to converge to their true values even as the sample size grows indefinitely. This occurs because the omitted variable introduces a systematic between the included regressors and the error term, preventing the from yielding unbiased results in the limit. The asymptotic properties of the biased highlight this persistence, where the probability limit of the estimate is given by plimβ^1=β1+γδ,\plim \hat{\beta}_1^* = \beta_1 + \gamma \delta, with β1\beta_1 as the true , γ\gamma as the on the omitted variable, and δ\delta as the auxiliary regression measuring the between the included regressor and the omitted variable. This formula demonstrates that the does not diminish with larger samples, leading to unreliable in misspecified models. The direction of the bias depends on the signs of γ\gamma and δ\delta: it is positive (upward) if these signs match, inflating the estimated effect, or negative (downward) if they oppose, potentially attenuating or reversing the true relationship. The magnitude of the bias is influenced by the strength of the correlations involved; stronger covariances between the included and omitted variables amplify the distortion, making even modest omissions problematic in highly correlated data structures. OVB contributes to endogeneity by correlating the regressors with the error term through the omitted factor, but it remains distinct from reverse , where the dependent variable directly influences the independent variable.

Ordinary Least Squares Effects

In ordinary least squares (OLS) regression, the is given by β^OLS=(XX)1XY\hat{\beta}_{OLS} = (X'X)^{-1} X'Y, where YY is the dependent variable vector, XX includes the regressors, and the error term is assumed to satisfy the strict exogeneity condition E[uX]=0E[u | X] = 0. When an omitted variable ZZ is present, the true model becomes Y=Xβ+Zγ+vY = X\beta + Z\gamma + v, with vv uncorrelated with XX and ZZ, resulting in a composite error u=Zγ+vu = Z\gamma + v. This leads to Cov(X,u)0\text{Cov}(X, u) \neq 0 if Cov(X,Z)0\text{Cov}(X, Z) \neq 0 and γ0\gamma \neq 0, violating the exogeneity assumption and rendering the OLS biased and inconsistent. The bias in the OLS coefficients manifests as E[β^]=β+Cov(X,Z)Var(X)γE[\hat{\beta}] = \beta + \frac{\text{Cov}(X, Z)}{\text{Var}(X)} \gamma, where the direction and magnitude depend on the correlations involved, potentially over- or underestimating the true effects and even reversing their signs. This biased estimation propagates to other diagnostics; for instance, the reported R2R^2 becomes misleading because it measures fit against a misspecified model, over- or understating without capturing the full variation due to the omitted factor. Beyond coefficient bias, OVB distorts standard errors, as they are computed under the invalid assumption of correctly specified errors, often resulting in inflated or deflated values that invalidate t-tests and intervals. This failure arises because the t-statistic, t=β^β0SE(β^)t = \frac{\hat{\beta} - \beta_0}{\text{SE}(\hat{\beta})}, relies on a distorted numerator and denominator, leading to incorrect p-values and misinterpretations. OVB also induces inefficiency in OLS estimation, as omitting a relevant variable increases the residual variance—absorbing unexplained variation into the term—which in turn elevates the variance of the estimates, even in cases where might be absent (though is the primary issue here). In finite samples, this emerges immediately rather than asymptotically, exacerbating variance and compounding problems from the outset of .

Mitigation

Detection Methods

Detecting omitted-variable bias (OVB) is essential to ensure the validity of regression estimates, as undetected omissions can lead to misleading inferences about causal relationships. Traditional econometric tests and qualitative assessments provide tools to identify potential OVB by checking for model misspecification or endogeneity arising from excluded confounders. These methods focus on diagnostic procedures rather than direct measurement of the bias, often requiring assumptions about the or auxiliary . The Ramsey Regression Equation Specification Error Test (RESET), introduced by Ramsey in 1969, serves as a general diagnostic for functional form misspecification, including potential OVB. The test involves augmenting the original regression model with powers of the fitted values from an ordinary least squares (OLS) estimation and then performing an to assess whether these additional terms are jointly significant. A rejection of the suggests that the model may suffer from omitted nonlinear terms or variables, indicating possible OVB, though it does not isolate the exact source. This test is widely applied due to its simplicity and applicability to linear models without requiring knowledge of specific omitted variables. The Hausman specification test, developed by Hausman in 1978, detects endogeneity that may stem from OVB by comparing OLS estimates, which are consistent under correct specification but inefficient if endogeneity exists, with instrumental variable (IV) estimates, which are consistent under endogeneity but inefficient otherwise. The , based on the difference between these estimators, follows a under the null of no endogeneity (i.e., no OVB or other violations). A significant result signals the need for robustness checks, as it implies that OLS coefficients are biased due to correlation between regressors and the error term, potentially from omitted variables. This approach is particularly useful when valid instruments are available to proxy for the omitted factors. Correlation checks on residuals offer a straightforward graphical and statistical diagnostic for OVB. After fitting the OLS model, one examines the residuals for systematic patterns, such as or with observable proxies for potential omitted variables; for instance, if residuals correlate significantly with a variable known to influence the outcome but excluded from the model, this flags OVB. Scatterplots of residuals against included regressors or suspected confounders can reveal nonlinearity or trends indicative of misspecification. These checks rely on residual analysis principles and are recommended as preliminary steps in model validation. Theoretical assessment using is a foundational, non-statistical method to preemptively identify OVB risks. Researchers draw on substantive or prior to hypothesize potential omitted confounders that correlate with both and dependent variables, such as socioeconomic factors in educational outcome models. This qualitative evaluation guides variable inclusion and sensitivity analyses, ensuring models align with causal mechanisms in the field. It is especially valuable in contexts where data limitations prevent formal testing. Sensitivity analyses can be formalized using frameworks that quantify how strong unmeasured confounders must be to overturn key findings, such as the approach by Cinelli and Hazlett (2020), which extends the OVB formula to assess robustness without assuming specific forms for omitted variables and implements tools like the sensemakr for practical application. Recent advancements incorporate techniques, such as the double Lasso (or post-double-selection Lasso), to detect OVB in high-dimensional settings. Proposed by Belloni, Chernozhukov, and Hansen in 2014, this method applies regularization twice—once to select controls for the outcome and once for the treatment—to identify relevant variables while controlling for omissions that could treatment effect estimates. If the selected variables substantially alter the coefficient of interest upon inclusion, it signals potential OVB from prior exclusions. This approach addresses traditional methods' limitations in large datasets by automating variable selection and flagging omissions through inference stability checks.

Remedial Approaches

The primary remedy for omitted-variable bias (OVB) is to include the omitted variable in the model if it is observable and measurable, thereby ensuring that the regression specification accounts for all relevant confounders correlated with both the included explanatory variables and the error term. This approach restores the exogeneity assumption under ordinary least squares (OLS), yielding unbiased and consistent estimates of the parameters of interest. When the omitted variable is unobservable, instrumental variables (IV) estimation provides a key alternative by employing an instrument WW that is correlated with the omitted variable ZZ (or the endogenous regressor influenced by ZZ) but uncorrelated with the error term ϵ\epsilon. The method, often implemented via two-stage least squares (2SLS), isolates exogenous variation in the endogenous variable through the instrument's first-stage relationship, producing consistent estimates of the causal effect, such as the local average treatment effect (LATE) for subgroups affected by the instrument. Valid instruments require relevance (strong correlation with the endogenous variable) and exogeneity (no direct effect on the outcome except through the endogenous variable, per the exclusion restriction). In settings, fixed effects models address OVB from time-invariant unobserved heterogeneity by differencing out entity-specific effects, such as individual ability or firm characteristics, that remain constant over time. This within-transformation, equivalent to including entity dummies, eliminates bias from confounders correlated with time-invariant regressors, assuming the omitted variables do not vary over time. However, fixed effects cannot mitigate bias from time-varying omitted variables, necessitating complementary approaches like IV. Proxy variables offer a partial solution when a direct measure of the omitted variable is unavailable, by incorporating an imperfect but correlated observable surrogate that approximates its influence and attenuates the bias in OLS estimates. For instance, a measurable indicator like test scores might proxy for unobservable in wage regressions, reducing but not fully eliminating OVB if the proxy introduces measurement error. Other quasi-experimental methods, such as difference-in-differences (DiD), further isolate causal effects by comparing changes over time between treated and control groups, thereby netting out time-invariant omitted variables and common time trends under the parallel trends assumption. Similarly, regression discontinuity designs exploit sharp cutoffs in treatment assignment based on a running variable to estimate local effects near the threshold, where continuity assumptions rule out jumps from omitted confounders. These remedial approaches involve trade-offs: over-inclusion of variables to guard against OVB risks multicollinearity, which inflates standard errors and reduces coefficient precision without biasing estimates, while under-inclusion perpetuates the bias. Researchers must balance model parsimony with comprehensiveness, often guided by detection methods to assess specification adequacy.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.