Hubbry Logo
Endogeneity (econometrics)Endogeneity (econometrics)Main
Open search
Endogeneity (econometrics)
Community hub
Endogeneity (econometrics)
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Endogeneity (econometrics)
Endogeneity (econometrics)
from Wikipedia

In econometrics, endogeneity broadly refers to situations in which an explanatory variable is correlated with the error term.[1]

In simplest terms, endogeneity means that a factor or cause one uses to explain something as an outcome is also being influenced by that same thing. For example, education can affect income, but income can also affect how much education someone gets. When this happens, one's analysis might wrongly estimate cause and effect. The thing one thinks is causing change is also being influenced by the outcome, making the results unreliable.

The concept originates from simultaneous equations models, in which one distinguishes variables whose values are determined within the economic model (endogenous) from those that are predetermined (exogenous).[a][2]

Ignoring simultaneity in estimation leads to biased and inconsistent estimators, as it violates the exogeneity condition of the Gauss–Markov theorem. This issue is often overlooked in non-experimental research, which limits the validity of causal inference and the ability to draw reliable policy recommendations.[3]

Common solutions to address endogeneity include the use of instrumental variable techniques, which provide consistent estimators by introducing variables that are correlated with the endogenous explanatory variable but uncorrelated with the error term.

Besides simultaneity, correlation between explanatory variables and the error term can arise when an unobserved or omitted variable is confounding both independent and dependent variables, or when independent variables are measured with error.[4]

Exogeneity versus endogeneity

[edit]

In a stochastic model, the notion of the usual exogeneity, sequential exogeneity, strong/strict exogeneity can be defined. Exogeneity is articulated in such a way that a variable or variables is exogenous for parameter . Even if a variable is exogenous for parameter , it might be endogenous for parameter .

When the explanatory variables are not stochastic, then they are strong exogenous for all the parameters.

If the independent variable is correlated with the error term in a regression model then the estimate of the regression coefficient in an ordinary least squares (OLS) regression is biased; however if the correlation is not contemporaneous, then the coefficient estimate may still be consistent. There are many methods of correcting the bias, including instrumental variable regression and Heckman selection correction.

Static models

[edit]

The following are some common sources of endogeneity.

Omitted variable

[edit]

In this case, the endogeneity comes from an uncontrolled confounding variable, a variable that is correlated with both the independent variable in the model and with the error term. (Equivalently, the omitted variable affects the independent variable and separately affects the dependent variable.)

Assume that the "true" model to be estimated is

but is omitted from the regression model (perhaps because there is no way to measure it directly). Then the model that is actually estimated is

where (thus, the term has been absorbed into the error term).

If the correlation of and is not 0 and separately affects (meaning ), then is correlated with the error term .

Here, is not exogenous for and , since, given , the distribution of depends not only on and , but also on and .

Measurement error

[edit]

Suppose that a perfect measure of an independent variable is impossible. That is, instead of observing , what is actually observed is where is the measurement error or "noise". In this case, a model given by

can be written in terms of observables and error terms as

Since both and depend on , they are correlated, so the OLS estimation of will be biased downward.

Measurement error in the dependent variable, , does not cause endogeneity, though it does increase the variance of the error term.

Simultaneity

[edit]

Suppose that two variables are codetermined, with each affecting the other according to the following "structural" equations:

Estimating either equation by itself results in endogeneity. In the case of the first structural equation, . Solving for while assuming that results in

.

Assuming that and are uncorrelated with ,

.

Therefore, attempts at estimating either structural equation will be hampered by endogeneity.

Dynamic models

[edit]

The endogeneity problem is particularly relevant in the context of time series analysis of causal processes. It is common for some factors within a causal system to be dependent for their value in period t on the values of other factors in the causal system in period t − 1. Suppose that the level of pest infestation is independent of all other factors within a given period, but is influenced by the level of rainfall and fertilizer in the preceding period. In this instance it would be correct to say that infestation is exogenous within the period, but endogenous over time.

Let the model be y = f(xz) + u. If the variable x is sequential exogenous for parameter , and y does not cause x in the Granger sense, then the variable x is strongly/strictly exogenous for the parameter .

Simultaneity

[edit]

Generally speaking, simultaneity occurs in the dynamic model just like in the example of static simultaneity above.

See also

[edit]

Footnotes

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In econometrics, endogeneity arises when an explanatory variable in a regression model is correlated with the error term, violating the exogeneity assumption required for ordinary least squares (OLS) estimation to produce unbiased and consistent parameter estimates. This correlation prevents reliable causal inference, as the estimates may reflect spurious relationships rather than true effects. The primary sources of endogeneity include omitted variables, where unobserved factors influence both the dependent and independent variables, leading to bias in the coefficients of included regressors; simultaneity, involving bidirectional causality between variables, such as when firm value and corporate policies mutually determine each other; and measurement error, where imperfect proxies for variables introduce noise that correlates with the error term, often causing attenuation bias. For instance, in studies of CEO compensation, omitted executive ability can bias estimates of the relationship between firm size and pay. These issues are particularly prevalent in fields like corporate finance and labor economics, where data limitations and complex interactions amplify the problem. To address endogeneity, econometricians employ methods such as instrumental variables (IV) estimation, which uses exogenous instruments correlated with the endogenous regressor but uncorrelated with the error term to isolate causal effects; fixed effects models in to control for unobserved heterogeneity; difference-in-differences (DiD) designs that exploit natural experiments or policy changes; and regression discontinuity designs (RDD) leveraging sharp cutoffs in treatment assignment. A classic IV example is using the of a CEO's first-born as an instrument for succession in firm performance studies, assuming it affects outcomes only through primogeniture traditions. These techniques require careful validation, such as testing instrument relevance with exceeding 10, to ensure robust inference.

Fundamentals

Exogeneity

In , strict exogeneity represents a stringent condition under which an explanatory variable XX is uncorrelated with the term ϵ\epsilon across all time periods, ensuring that the of the given the entire history of XX is zero. Formally, XX is strictly exogenous if E[ϵtXs s]=0E[\epsilon_t \mid X_s \ \forall s] = 0 for all tt, which can be compactly expressed as E[ϵX]=0E[\epsilon \mid X] = 0. This assumption implies no feedback from the term to any realization of XX, past, present, or future, making it particularly relevant in dynamic or settings where temporal dependencies are present. Weak exogeneity, in contrast, is a less restrictive variant that centers on the conditional mean in the contemporaneous period, requiring only E[ϵtXt]=0E[\epsilon_t \mid X_t] = 0 without demanding from the full history or future values of XX. This allows for possible correlations between current errors and future explanatory variables, such as through feedback mechanisms, but suffices for unbiased estimation in models focused on conditional means, like many applications of OLS in time series. The distinctions between strict and weak exogeneity were formally introduced by Engle, Hendry, and Richard in their seminal 1983 paper, which proposed definitions in terms of the joint distribution of observable variables to address ambiguities in prior econometric usage and facilitate testing and model reduction. Under the classical assumptions of the model, exogeneity—typically the strict form in or weak in some dynamic contexts—ensures that ordinary (OLS) estimators are unbiased and consistent, as the between regressors and errors allows the moment conditions to hold. Without this assumption, OLS estimates suffer from bias, underscoring exogeneity's role as the baseline for valid in econometric analysis. Endogeneity occurs precisely when these exogeneity conditions are violated, resulting in between explanatory variables and the error term.

Endogeneity

In , endogeneity arises when one or more explanatory variables in a regression model are correlated with the error term, violating the core assumption of exogeneity that requires such variables to be uncorrelated with unobserved factors influencing the dependent variable. This correlation, formally expressed as Cov(X,ϵ)0\operatorname{Cov}(X, \epsilon) \neq 0, implies that the explanatory variables do not originate solely from external sources but are influenced by the same underlying processes captured in the error term. Under endogeneity, ordinary least squares (OLS) estimators fail to provide consistent estimates of the true parameters. Specifically, the probability limit of the OLS estimator β^\hat{\beta} is given by plim(β^)=β+(E[XX])1E[Xϵ],\operatorname{plim}(\hat{\beta}) = \beta + (E[X'X])^{-1} E[X'\epsilon], where the second term represents the bias due to the non-zero covariance between XX and ϵ\epsilon. This deviation from the true parameter β\beta persists even as the sample size grows, leading to systematically incorrect inferences about causal relationships. Endogeneity can stem from various sources, including omitted factors that jointly affect both the explanatory variables and the outcome, reverse where the dependent variable influences the explanatory variables, or errors in measuring the explanatory variables that propagate into the error term. A classic illustration is a equation model where years of is used to explain log wages, but innate ability is omitted from the specification; since ability correlates with both education choices and wages, education becomes endogenous, biasing the estimated return to upward. The presence of endogeneity generally results in inconsistent point estimates, undermining the reliability of regression-based conclusions across empirical economic analyses, though it does not directly impair estimator efficiency in finite samples.

Causes in Static Models

Omitted Variables

arises in static models when a relevant explanatory variable is excluded from the specification, leading to correlation between the included regressors and the error term. Consider the true Y=β0+β1X+β2Z+εY = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon, where ZZ is an omitted variable that affects the outcome YY, and ε\varepsilon is uncorrelated with both XX and ZZ. If ZZ is omitted, the estimated model becomes Y=β0+β1X+εY = \beta_0 + \beta_1 X + \varepsilon^*, where the composite error ε=β2Z+ε\varepsilon^* = \beta_2 Z + \varepsilon. This omission induces endogeneity if XX and ZZ are correlated, as Cov(X,ε)=β2Cov(X,Z)0\text{Cov}(X, \varepsilon^*) = \beta_2 \text{Cov}(X, Z) \neq 0. The direction of the resulting bias in the estimator β^1\hat{\beta}_1 depends on the signs of β2\beta_2 and Cov(X,Z)\text{Cov}(X, Z). In the simple case with a single included regressor, the expected value of the ordinary least squares (OLS) estimator is E[β^1]=β1+β2Cov(X,Z)Var(X)E[\hat{\beta}_1] = \beta_1 + \beta_2 \frac{\text{Cov}(X, Z)}{\text{Var}(X)}, which deviates from the true β1\beta_1 unless β2=0\beta_2 = 0 or Cov(X,Z)=0\text{Cov}(X, Z) = 0. This bias is inconsistent, persisting even as sample size increases, and can lead to misleading inferences about the causal effect of XX on YY. A classic example occurs in estimating the returns to on wages, where innate is often omitted. positively affects both wages and , so omitting it results in Cov([education](/page/Education),[ability](/page/Ability))>0\text{Cov}(\text{[education](/page/Education)}, \text{[ability](/page/Ability)}) > 0 and β2>0\beta_2 > 0, yielding an upward-biased estimate of the —potentially overstating returns by 20-30% or more in . To recover unbiased estimates, researchers may include proxy variables that capture the omitted factor without introducing new correlations, or exploit to control for time-invariant unobserved heterogeneity through fixed effects.

Measurement Error

error in explanatory variables represents a key source of endogeneity in static econometric models, where the observed variable XX^* becomes correlated with the model's term, violating the exogeneity assumption required for consistent ordinary (OLS) estimation. In the classical framework, the true unobserved explanatory variable XX relates to the observed XX^* as X=X+uX^* = X + u, where uu is the satisfying E[uX]=0E[u \mid X] = 0 and uncorrelated with the true ϵ\epsilon in the structural equation Y=βX+ϵY = \beta X + \epsilon. Substituting yields the observed regression Y=βX+ϵY = \beta X^* + \epsilon', with composite ϵ=ϵβu\epsilon' = \epsilon - \beta u. This induces endogeneity because Cov(X,ϵ)=βVar(u)0\text{Cov}(X^*, \epsilon') = -\beta \text{Var}(u) \neq 0 (assuming β>0\beta > 0), as the uu enters the error term with an opposite sign to XX^*. Consequently, the probability limit of the OLS estimator is plimβ^=βVar(X)Var(X)+Var(u)<β\plim \hat{\beta} = \beta \frac{\text{Var}(X)}{\text{Var}(X) + \text{Var}(u)} < \beta, resulting in attenuation bias toward zero whose severity depends on the signal-to-noise ratio Var(X)Var(u)\frac{\text{Var}(X)}{\text{Var}(u)}. Non-classical measurement error arises when uu correlates with XX or ϵ\epsilon, further complicating the endogeneity and potentially reversing the bias direction. For instance, if respondents report values as optimal predictors of the true XX (e.g., due to recall bias in survey data), the error uu may negatively correlate with XX^*, amplifying attenuation or even causing positive bias in β^\hat{\beta}. The general form of inconsistency is plimβ^=β(1βuX)\plim \hat{\beta} = \beta (1 - \beta_{u X^*}), where βuX=Cov(u,X)Var(X)\beta_{u X^*} = \frac{\text{Cov}(u, X^*)}{\text{Var}(X^*)} captures the correlation structure, allowing bias in either direction depending on the sign and magnitude of Cov(u,X)\text{Cov}(u, X). This non-classical case often prevails in empirical settings with self-reported data, leading to unpredictable inconsistencies that undermine causal inference. A representative example occurs in growth regressions, where measurement errors in explanatory variables like initial GDP or education levels bias estimates of their effects on economic growth. Such biases mask the true impact of human capital on growth, prompting corrections via instrumental variables or multiple measurements. The implications of measurement error extend to the dependent variable as well. In classical cases for Y=Y+vY^* = Y + v with vv uncorrelated to regressors, OLS coefficients remain consistent, though standard errors inflate due to increased residual variance. However, non-classical errors in YY—such as systematic underreporting—can correlate YY^* with the error term, inducing endogeneity and potentially amplifying bias away from zero in the presence of other violations like omitted variables. Overall, while classical errors in explanatory variables typically attenuate effects, non-classical forms demand careful modeling to avoid severe inconsistencies.

Simultaneity

Simultaneity in static econometric models occurs when multiple endogenous variables are jointly determined through mutual causal relationships, resulting in a correlation between the explanatory variables and the disturbance terms in the structural equations. This correlation violates the strict exogeneity assumption required for consistent estimation using , leading to biased and inconsistent parameter estimates. In such systems, the contemporaneous interdependence means that no variable can be treated as truly exogenous, as each influences the others within the same time period. A prototypical mechanism is found in supply and demand systems, where price and quantity are simultaneously determined. Consider the structural demand equation Qd=α+βP+γZ+uQ_d = \alpha + \beta P + \gamma Z + u and supply equation Qs=δ+θP+ϕW+vQ_s = \delta + \theta P + \phi W + v, where Qd=Qs=QQ_d = Q_s = Q at equilibrium, ZZ represents demand shifters (e.g., consumer income), WW denotes supply shifters (e.g., production costs), and u,vu, v are structural errors potentially correlated, such as Cov(u,v)0\text{Cov}(u, v) \neq 0. Solving for the reduced form yields Q=π0+π1X+νQ = \pi_0 + \pi_1 X + \nu and P=π0+π1X+νP = \pi_0' + \pi_1' X + \nu', where XX includes the exogenous shifters and ν,ν\nu, \nu' aggregate the structural errors. However, because the reduced-form errors incorporate the correlated structural disturbances, regressing the structural form on observed data introduces endogeneity, as PP (or QQ) correlates with the composite error in either equation. This simultaneity bias prevents direct recovery of structural parameters like β\beta or θ\theta without additional identifying restrictions. In a general linear simultaneous equations system with two endogenous variables, the model takes the form Y1=α+βY2+μ1X1+u1Y_1 = \alpha + \beta Y_2 + \mu_1' X_1 + u_1 and Y2=γ+δY1+μ2X2+u2Y_2 = \gamma + \delta Y_1 + \mu_2' X_2 + u_2, where X1X_1 and X2X_2 are vectors of exogenous variables excluded from the respective opposite equations, and Cov(u1,u2)0\text{Cov}(u_1, u_2) \neq 0. The mutual dependence implies that Y1Y_1 is endogenous in the first equation due to its correlation with u1u_1 through Y2Y_2 and the shared error covariance, and similarly for Y2Y_2 in the second. Identification requires satisfying order and rank conditions on the exclusion restrictions, such as at least one exogenous variable unique to each equation to trace structural effects. Seminal analysis by Koopmans established these criteria, highlighting how failure to impose such restrictions renders the system underidentified and exacerbates the bias from simultaneity. A representative application appears in labor market models, where wages and hours worked (or employment levels) are endogenously determined by intersecting labor supply and demand curves. Labor supply increases with wages due to income and substitution effects, while demand decreases with wages given productivity constraints, creating bidirectional causality; estimating either curve via OLS thus yields inconsistent estimates unless instruments break the simultaneity. This static setup assumes no intertemporal lags, focusing on contemporaneous equilibrium rather than dynamic adjustments over time.

Causes in Dynamic Models

Lagged Dependent Variables

In dynamic econometric models, the inclusion of lagged dependent variables as regressors introduces endogeneity when the lagged term correlates with the current error term. Consider the autoregressive model of order one (AR(1)):
yt=ρyt1+βxt+εt,y_t = \rho y_{t-1} + \beta x_t + \varepsilon_t,
where yty_t is the dependent variable at time tt, ρ\rho is the autoregressive parameter, xtx_t are exogenous regressors, and εt\varepsilon_t is the error term. If the errors exhibit serial correlation, such as εt=γεt1+ut\varepsilon_t = \gamma \varepsilon_{t-1} + u_t with γ>0|\gamma| > 0 and utu_t iid, then Cov(yt1,εt)0\text{Cov}(y_{t-1}, \varepsilon_t) \neq 0, because yt1y_{t-1} embeds past errors that influence εt\varepsilon_t. Specifically, E[yt1εt]=ρE[εt1εt]+E[εt1ut]0E[y_{t-1} \varepsilon_t] = \rho E[\varepsilon_{t-1} \varepsilon_t] + E[\varepsilon_{t-1} u_t] \neq 0, rendering the lagged dependent variable endogenous and biasing (OLS) estimates.
This endogeneity arises from temporal dependence in the data-generating process, distinguishing it from static models where simultaneity involves contemporaneous mutual causation without lagged effects. In settings with fixed effects, the problem persists even without serial correlation in the idiosyncratic errors, due to the correlation between the lagged dependent variable and the transformed errors after demeaning. For an AR(1) panel model with individual fixed effects and time dimension TT, the within-group of ρ\rho suffers from a finite-TT bias, known as the Nickell bias. Nickell (1981) derives an approximate bias formula for this estimator as 1+ρT1+1Tj=1T1j(1ρ)j-\frac{1 + \rho}{T - 1 + \frac{1}{T} \sum_{j=1}^{T-1} j (1 - \rho)^j}, which approaches 1+ρT-\frac{1 + \rho}{T} for large TT and ρ<1|\rho| < 1, leading to downward bias in the estimated persistence parameter. A representative example occurs in empirical economic growth models, where lagged GDP is included to capture convergence dynamics but introduces endogeneity due to persistent shocks affecting both current and past output. In cross-country panel regressions of the form Δlnyit=αlnyi,t1+βzit+εit\Delta \ln y_{it} = \alpha \ln y_{i,t-1} + \beta' z_{it} + \varepsilon_{it}, where yity_{it} is GDP for country ii at time tt and zitz_{it} are controls like investment rates, the lagged term correlates with the error through unobserved persistent factors such as technology shocks or policy inertia. Caselli, Esquivel, and Lefort (1996) address this using (GMM) estimators in dynamic panels, finding convergence rates around 10% per year after correcting for the endogeneity of initial income levels.

Dynamic Simultaneity

Dynamic simultaneity refers to a form of endogeneity in dynamic econometric models where multiple endogenous variables are jointly determined at each time period, with their current values influenced by lags and correlated error terms across equations. Consider a simple two-equation dynamic system: Y1t=α+βY2t+γY1,t1+u1tY_{1t} = \alpha + \beta Y_{2t} + \gamma Y_{1,t-1} + u_{1t} Y2t=δ+θY1t+ϕY2,t1+u2tY_{2t} = \delta + \theta Y_{1t} + \phi Y_{2,t-1} + u_{2t} where the error terms u1tu_{1t} and u2tu_{2t} are contemporaneously correlated, such as Cov(u1t,u2t)0\text{Cov}(u_{1t}, u_{2t}) \neq 0. This correlation implies that Y2tY_{2t} is endogenous in the first equation because it shares common shocks with u1tu_{1t}, rendering ordinary estimates biased and inconsistent. The mechanism driving endogeneity in these systems stems from intertemporal feedback: the current value of an explanatory variable XtX_t (or another endogenous variable) affects the dependent variable YtY_t, but YtY_t in turn influences future values of XX, propagating correlations backward through lags to create Cov(Xt,ϵt)0\text{Cov}(X_t, \epsilon_t) \neq 0. This dynamic feedback violates strict exogeneity, as past realizations of YY shape current regressors, often arising in processes with or adjustment costs. A representative example appears in dynamic macroeconomic models, such as intertemporal extensions of the IS-LM framework, where and output are simultaneously determined over time. Here, current boosts output, while lagged output signals influence decisions through mechanisms like the accelerator principle, incorporating adjustment lags that entwine the variables intertemporally and generate endogenous correlations. In contrast to static simultaneity, which involves only contemporaneous mutual causation, dynamic simultaneity integrates temporal elements like lagged adjustments or forward-looking expectations, allowing shocks to persist and amplify endogeneity across periods. This structure captures real-world economic dynamics but complicates identification, as the lagged terms entangle current and past influences.

Consequences

Bias and Inconsistency

Endogeneity in econometric models leads to systematic deviations in parameter estimates from their true values, even as the sample size grows large. In ordinary least squares (OLS) estimation, this manifests as bias and inconsistency when an explanatory variable XX is correlated with the error term ϵ\epsilon, violating the strict exogeneity assumption. Bias refers to the expected value of the estimator differing from the true parameter, while inconsistency means the estimator does not converge in probability to the true parameter as the sample size nn \to \infty. For instance, omitted variables that influence both XX and the outcome can induce such correlation, resulting in persistently erroneous estimates. The asymptotic form of the OLS estimator under endogeneity is given by plimnβ^OLS=β+(plimnXXn)1(plimnXϵn),\text{plim}_{n \to \infty} \hat{\beta}_{OLS} = \beta + \left( \text{plim}_{n \to \infty} \frac{X'X}{n} \right)^{-1} \left( \text{plim}_{n \to \infty} \frac{X'\epsilon}{n} \right), where β\beta is the true parameter vector, and the second term represents the bias, which equals (E[XX]/n)1(E[Xϵ]/n)(E[X'X]/n)^{-1} (E[X'\epsilon]/n) under standard assumptions. This bias term arises because E[Xϵ/n]0E[X'\epsilon/n] \neq 0 due to Cov(X,ϵ)0\text{Cov}(X, \epsilon) \neq 0, ensuring the probability limit does not equal β\beta. In the scalar case, it simplifies to plimβ^OLS=β+Cov(X,ϵ)Var(X)\text{plim} \hat{\beta}_{OLS} = \beta + \frac{\text{Cov}(X, \epsilon)}{\text{Var}(X)}, highlighting how the covariance drives the deviation. The direction of the bias depends on the sign of Cov(X,ϵ)\text{Cov}(X, \epsilon): positive leads to upward (overestimation of β\beta), while negative causes downward (underestimation). In small samples, this can be substantial and variable, but inconsistency implies it persists and dominates in large samples, preventing reliable on causal effects. For example, in , endogeneity from unobserved ability in education regressions often produces upward in estimated returns to schooling, as higher ability correlates positively with both education and . In dynamic panel models, endogeneity exacerbates through the incidental parameters problem, where fixed effects estimators suffer from persistent due to estimating individual-specific parameters. Including lagged dependent variables as regressors introduces correlation with the error term, leading to the Nickell bias, which is of order O(1/T)O(1/T) (where TT is time periods) and worsens with short panels. This dynamic endogeneity amplifies inconsistency in fixed effects OLS, particularly when unobserved heterogeneity interacts with time-varying shocks. A representative example is the evaluation of policy interventions, such as increases, where endogeneity from simultaneous price and quantity adjustments can cause OLS to overestimate effects if unobserved firm responses correlate positively with wages and outcomes.

Inference Problems

Endogeneity in econometric models violates the exogeneity assumption underlying ordinary (OLS) estimation, rendering the standard errors of coefficient estimates invalid even when the in point estimates is minimal. The variance-covariance matrix formula for OLS assumes that explanatory variables are uncorrelated with the term, but endogeneity introduces such , leading to understated measures of and overly narrow intervals. This miscalculation of variability can produce misleading precision in estimates, as the true deviates from the assumed homoskedastic and uncorrelated structure. As a result, t-tests and F-tests become unreliable under endogeneity, distorting the probabilities of Type I and Type II errors and often yielding spurious . For instance, an endogenous regressor may inflate test statistics, causing researchers to incorrectly reject null hypotheses of no effect, while the actual uncertainty remains higher than reported. In policy evaluation contexts, such as assessing the impact of an endogenous treatment like job training programs on firm productivity, this overconfidence can lead to erroneous conclusions about program effectiveness, potentially justifying misguided interventions based on falsely precise estimates. In dynamic models, endogeneity exacerbates inference problems through induced in the error terms, particularly in time series data where lagged variables correlate with current shocks. This serial correlation violates the strict exogeneity required for consistent , amplifying the understatement of and further invalidating tests by propagating errors across periods. For example, in models of market entry with persistent unobservables, ignoring this dynamic endogeneity can bias counterfactual analyses, such as entry responses to changes, by underestimating the variability in outcomes. These inference issues compound any bias from endogeneity, as unreliable standard errors and tests undermine the overall reliability of conclusions drawn from the model.

Detection

Hausman Specification Test

The Hausman specification test, introduced by Jerry A. Hausman in , is a statistical procedure used to detect endogeneity in econometric models by comparing estimates from ordinary least squares (OLS) and instrumental variables (IV) regression. Under the of no endogeneity (i.e., regressors are exogenous), both estimators are consistent, but OLS is efficient; under the alternative, OLS is inconsistent while IV remains consistent if instruments are valid. The test exploits the asymptotic difference between the two estimators to assess model misspecification arising from endogeneity. The test statistic is computed as H=(β^OLSβ^IV)[Var(β^OLS)Var(β^IV)]1(β^OLSβ^IV),H = (\hat{\beta}_{OLS} - \hat{\beta}_{IV})' [\text{Var}(\hat{\beta}_{OLS}) - \text{Var}(\hat{\beta}_{IV})]^{-1} (\hat{\beta}_{OLS} - \hat{\beta}_{IV}), which follows a χ2\chi^2 distribution with degrees of freedom equal to the number of potentially endogenous regressors under the null hypothesis. If the instruments are valid (uncorrelated with the error term) and relevant (sufficiently correlated with the endogenous regressors), rejection of the null at conventional significance levels indicates endogeneity, suggesting the need for IV estimation. The intuition behind the test relies on the efficiency of OLS when exogeneity holds: the parameter estimates from OLS and IV should not differ systematically, as any difference would reflect in OLS due to endogeneity. Key assumptions include the consistency of the IV estimator (requiring valid and relevant instruments) and that the difference is positive semi-definite to ensure the test's validity. The test has been extended to settings, where it compares fixed-effects and random-effects estimators to detect between individual effects and regressors, assuming the same instrument validity conditions. A classic application appears in wage models testing the endogeneity of , where quarter of birth serves as an instrument due to its influence on schooling via compulsory laws. In this setup, OLS estimates the return to at around 7%, while IV estimates are higher, closer to 10-13%, suggesting endogeneity from omitted variables like innate . This example illustrates how the test identifies endogeneity from omitted variables like innate , guiding the choice of robust estimation methods.

Durbin-Wu-Hausman Test

The Durbin-Wu-Hausman test, also known as the augmented regression test, provides a practical implementation of the Hausman specification test for detecting endogeneity in econometric models by incorporating residuals from auxiliary regressions. Developed through contributions by Durbin, Wu, and Hausman, it assesses whether an ordinary least squares (OLS) estimator is consistent by examining the correlation between suspected endogenous regressors and the error term. The procedure involves two main steps. First, for a suspected endogenous regressor XX, perform an auxiliary regression of XX on the set of instruments ZZ and all other exogenous variables included in the structural model: X=ZΠ+Wγ+e^,X = Z \Pi + W \gamma + \hat{e}, where WW denotes the exogenous covariates, and e^\hat{e} are the saved residuals, which capture any unobserved between XX and the structural error. Second, augment the original structural regression of the dependent variable YY by including these residuals: Y=Xβ+Wδ+e^θ+u.Y = X \beta + W \delta + \hat{e} \theta + u. The of exogeneity (H0:θ=0H_0: \theta = 0) is tested using a on θ\theta, which is asymptotically distributed as standard normal under the null, or an F-statistic for tests with multiple endogenous regressors. Rejection indicates endogeneity, as the residuals proxy for the component of XX correlated with the error term. Under homoskedasticity and standard assumptions, the Durbin-Wu-Hausman test is numerically equivalent to the original Hausman test based on the difference between OLS and instrumental variables (IV) estimators, but it is computationally simpler as it avoids direct variance-covariance matrix manipulations. Its advantages include straightforward implementation in regression software, the ability to accommodate multiple endogenous regressors by including corresponding residual terms in the augmented model, and extensions to robust versions that account for heteroskedasticity using adjusted standard errors. A representative application occurs in testing for simultaneity in a demand equation, where price PP is potentially endogenous due to joint determination with quantity. Supply-side shifters, such as input costs uncorrelated with shocks, serve as instruments ZZ. The test proceeds by regressing PP on ZZ and exogenous factors to obtain residuals e^\hat{e}, then augmenting the quantity regression with e^\hat{e} and testing its significance; rejection confirms endogeneity from simultaneous supply- interactions.

Solutions

Instrumental Variables

Instrumental variables (IV) estimation provides a fundamental approach to correcting for endogeneity in econometric models by leveraging exogenous variables, termed instruments, that influence the endogenous regressors but are uncorrelated with the model's error term. This method allows for consistent when ordinary (OLS) fails due to between the explanatory variables and the error. The core idea is to use the variation in the endogenous variable explained by the instruments to identify the causal effect on the outcome. In the structural equation Y=Xβ+ϵY = X \beta + \epsilon, where XX is endogenous, the IV estimator is given by β^IV=(ZX)1ZY,\hat{\beta}_{IV} = (Z' X)^{-1} Z' Y, with ZZ denoting the matrix of instruments. This formula assumes a just-identified model where the number of instruments equals the number of endogenous regressors plus exogenous ones. Equivalently, IV can be computed using two-stage least squares (2SLS): first, regress XX on ZZ (and any exogenous covariates) to obtain fitted values X^\hat{X}; second, regress YY on X^\hat{X} (and exogenous covariates) to recover β^\hat{\beta}. The 2SLS procedure projects the endogenous variables onto the space spanned by the instruments, isolating exogenous variation for estimation. Valid instruments must meet three key conditions. Relevance requires that the instruments correlate with the endogenous regressors, formally Cov(Z,X)0\text{Cov}(Z, X) \neq 0, ensuring sufficient . Exogeneity demands that the instruments are uncorrelated with the error term, Cov(Z,ϵ)=0\text{Cov}(Z, \epsilon) = 0, so they do not pick up omitted factors or reverse . The exclusion restriction stipulates that instruments affect the outcome YY solely through their impact on XX, preventing direct channels that could results. If these hold, the IV estimator is consistent, converging to the true β\beta as sample size grows, though its asymptotic variance exceeds that of OLS due to reliance on imperfect instruments. A prominent application involves estimating returns to , where schooling attainment (XX) is endogenous due to unobserved correlating with both and (YY). exploited geographic variation in proximity as an instrument, such as the presence of a 4-year in the county of residence during youth. This instrument satisfies the conditions: proximity correlates with enrollment (), is exogenous to individual (exogeneity), and influences only via (exclusion). The IV estimates yielded returns of 10-15% per year of schooling, substantially higher than OLS, highlighting the upward in naive regressions.

Fixed Effects and Other Methods

In panel data settings, fixed effects estimation addresses endogeneity arising from unobserved time-invariant individual-specific factors by exploiting within-unit variation over time. Consider the linear panel model Yit=αi+βXit+ϵitY_{it} = \alpha_i + \beta X_{it} + \epsilon_{it}, where ii indexes units (e.g., individuals), tt indexes time, αi\alpha_i captures unobserved time-invariant heterogeneity correlated with XitX_{it}, and ϵit\epsilon_{it} is the idiosyncratic error. To eliminate αi\alpha_i, the fixed effects estimator applies the within transformation, or demeaning: (YitYˉi)=β(XitXˉi)+(ϵitϵˉi)(Y_{it} - \bar{Y}_i) = \beta (X_{it} - \bar{X}_i) + (\epsilon_{it} - \bar{\epsilon}_i), where Yˉi\bar{Y}_i and Xˉi\bar{X}_i are unit-specific time means. This transformation removes αi\alpha_i, yielding consistent estimates of β\beta under the assumption of strict exogeneity of the demeaned regressors, thereby mitigating from time-invariant confounders. An alternative to demeaning is first-differencing, which also purges fixed effects by focusing on temporal changes: ΔYit=βΔXit+Δϵit\Delta Y_{it} = \beta \Delta X_{it} + \Delta \epsilon_{it}, where Δ\Delta denotes the first difference (YitYi,t1Y_{it} - Y_{i,t-1}). This approach assumes no unit roots in the data and strict exogeneity in differences, producing consistent estimates similar to fixed effects but with potentially higher efficiency in short panels if errors are serially uncorrelated. However, first-differencing amplifies measurement error and requires balanced panels, making it less flexible than demeaning in practice. Both methods leverage the panel structure to control for endogeneity without external instruments, though they rely on sufficient time-series variation within units. Other panel methods include random effects estimation, which assumes the individual effects αi\alpha_i are uncorrelated with regressors (Cov(αi,Xit)=0\text{Cov}(\alpha_i, X_{it}) = 0) and treats them as random draws from a distribution, allowing for more efficient use of between-unit variation via . This approach, developed in early work on error components models, yields consistent estimates under its exogeneity assumption but can be inconsistent if exists, as tested by the Hausman specification. For dynamic panels with lagged dependent variables, where endogeneity persists due to feedback, (GMM) estimators extend fixed effects by instrumenting endogenous lags with internal lagged levels or differences, addressing issues like the Nickell —a downward in fixed effects estimates of the autoregressive in finite samples. Despite these advantages, fixed effects and related methods cannot address time-varying endogeneity, such as from serially correlated shocks or policy changes affecting all units similarly, and may exacerbate bias in dynamic models via the Nickell effect, which is pronounced in short time dimensions (small TT). For instance, in panel studies of wage determination, fixed effects control for individual heterogeneity (e.g., innate ) correlated with , yielding unbiased returns to schooling estimates from within-person variation, but fail if time-varying factors like business cycles induce contemporaneous endogeneity. These techniques complement instrumental variables for residual endogeneity but prioritize panel-specific variation over external exclusion restrictions.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.