Recent from talks
Nothing was collected or created yet.
Endogeneity (econometrics)
View on WikipediaThis article may be too technical for most readers to understand. (January 2023) |
In econometrics, endogeneity broadly refers to situations in which an explanatory variable is correlated with the error term.[1]
In simplest terms, endogeneity means that a factor or cause one uses to explain something as an outcome is also being influenced by that same thing. For example, education can affect income, but income can also affect how much education someone gets. When this happens, one's analysis might wrongly estimate cause and effect. The thing one thinks is causing change is also being influenced by the outcome, making the results unreliable.
The concept originates from simultaneous equations models, in which one distinguishes variables whose values are determined within the economic model (endogenous) from those that are predetermined (exogenous).[a][2]
Ignoring simultaneity in estimation leads to biased and inconsistent estimators, as it violates the exogeneity condition of the Gauss–Markov theorem. This issue is often overlooked in non-experimental research, which limits the validity of causal inference and the ability to draw reliable policy recommendations.[3]
Common solutions to address endogeneity include the use of instrumental variable techniques, which provide consistent estimators by introducing variables that are correlated with the endogenous explanatory variable but uncorrelated with the error term.
Besides simultaneity, correlation between explanatory variables and the error term can arise when an unobserved or omitted variable is confounding both independent and dependent variables, or when independent variables are measured with error.[4]
Exogeneity versus endogeneity
[edit]In a stochastic model, the notion of the usual exogeneity, sequential exogeneity, strong/strict exogeneity can be defined. Exogeneity is articulated in such a way that a variable or variables is exogenous for parameter . Even if a variable is exogenous for parameter , it might be endogenous for parameter .
When the explanatory variables are not stochastic, then they are strong exogenous for all the parameters.
If the independent variable is correlated with the error term in a regression model then the estimate of the regression coefficient in an ordinary least squares (OLS) regression is biased; however if the correlation is not contemporaneous, then the coefficient estimate may still be consistent. There are many methods of correcting the bias, including instrumental variable regression and Heckman selection correction.
Static models
[edit]The following are some common sources of endogeneity.
Omitted variable
[edit]In this case, the endogeneity comes from an uncontrolled confounding variable, a variable that is correlated with both the independent variable in the model and with the error term. (Equivalently, the omitted variable affects the independent variable and separately affects the dependent variable.)
Assume that the "true" model to be estimated is
but is omitted from the regression model (perhaps because there is no way to measure it directly). Then the model that is actually estimated is
where (thus, the term has been absorbed into the error term).
If the correlation of and is not 0 and separately affects (meaning ), then is correlated with the error term .
Here, is not exogenous for and , since, given , the distribution of depends not only on and , but also on and .
Measurement error
[edit]Suppose that a perfect measure of an independent variable is impossible. That is, instead of observing , what is actually observed is where is the measurement error or "noise". In this case, a model given by
can be written in terms of observables and error terms as
Since both and depend on , they are correlated, so the OLS estimation of will be biased downward.
Measurement error in the dependent variable, , does not cause endogeneity, though it does increase the variance of the error term.
Simultaneity
[edit]Suppose that two variables are codetermined, with each affecting the other according to the following "structural" equations:
Estimating either equation by itself results in endogeneity. In the case of the first structural equation, . Solving for while assuming that results in
- .
Assuming that and are uncorrelated with ,
- .
Therefore, attempts at estimating either structural equation will be hampered by endogeneity.
Dynamic models
[edit]The endogeneity problem is particularly relevant in the context of time series analysis of causal processes. It is common for some factors within a causal system to be dependent for their value in period t on the values of other factors in the causal system in period t − 1. Suppose that the level of pest infestation is independent of all other factors within a given period, but is influenced by the level of rainfall and fertilizer in the preceding period. In this instance it would be correct to say that infestation is exogenous within the period, but endogenous over time.
Let the model be y = f(x, z) + u. If the variable x is sequential exogenous for parameter , and y does not cause x in the Granger sense, then the variable x is strongly/strictly exogenous for the parameter .
Simultaneity
[edit]Generally speaking, simultaneity occurs in the dynamic model just like in the example of static simultaneity above.
See also
[edit]Footnotes
[edit]- ^ For example, in a simple supply and demand model, when predicting the equilibrium quantity demanded, the price is endogenous because producers adjust their prices in response to demand, and consumers adjust their demand in response to price. In this case, the price variable exhibits total endogeneity once the demand and supply curves are specified. By contrast, a change in consumer tastes or preferences represents an exogenous shift in the demand curve.
References
[edit]- ^ Wooldridge, Jeffrey M. (2009). Introductory Econometrics: A Modern Approach (4th ed.). Australia: South-Western. p. 88. ISBN 978-0-324-66054-8.
- ^ Kmenta, Jan (1986). Elements of Econometrics (2nd ed.). New York: MacMillan. pp. 652–653. ISBN 0-02-365070-2.
- ^ Antonakis, John; Bendahan, Samuel; Jacquart, Philippe; Lalive, Rafael (December 2010). "On making causal claims: A review and recommendations" (PDF). The Leadership Quarterly. 21 (6): 1086–1120. doi:10.1016/j.leaqua.2010.10.010. ISSN 1048-9843.
- ^ Johnston, John (1972). Econometric Methods (Second ed.). New York: McGraw-Hill. pp. 267–291. ISBN 0-07-032679-7.
Further reading
[edit]- Greene, William H. (2012). Econometric Analysis (Sixth ed.). Upper Saddle River: Pearson. ISBN 978-0-13-513740-6.
- Kennedy, Peter (2008). A Guide to Econometrics (Sixth ed.). Malden: Blackwell. p. 139. ISBN 978-1-4051-8257-7.
- Kmenta, Jan (1986). Elements of Econometrics (Second ed.). New York: MacMillan. pp. 651–733. ISBN 0-02-365070-2.
External links
[edit]Endogeneity (econometrics)
View on GrokipediaFundamentals
Exogeneity
In econometrics, strict exogeneity represents a stringent condition under which an explanatory variable is uncorrelated with the error term across all time periods, ensuring that the conditional expectation of the error given the entire history of is zero. Formally, is strictly exogenous if for all , which can be compactly expressed as . This assumption implies no feedback from the error term to any realization of , past, present, or future, making it particularly relevant in dynamic or panel data settings where temporal dependencies are present.[4][5] Weak exogeneity, in contrast, is a less restrictive variant that centers on the conditional mean in the contemporaneous period, requiring only without demanding independence from the full history or future values of . This allows for possible correlations between current errors and future explanatory variables, such as through feedback mechanisms, but suffices for unbiased estimation in models focused on conditional means, like many applications of OLS in time series.[4][6] The distinctions between strict and weak exogeneity were formally introduced by Engle, Hendry, and Richard in their seminal 1983 paper, which proposed definitions in terms of the joint distribution of observable variables to address ambiguities in prior econometric usage and facilitate testing and model reduction.[7][8] Under the classical assumptions of the linear regression model, exogeneity—typically the strict form in cross-sectional data or weak in some dynamic contexts—ensures that ordinary least squares (OLS) estimators are unbiased and consistent, as the orthogonality between regressors and errors allows the population moment conditions to hold.[4][5] Without this assumption, OLS estimates suffer from bias, underscoring exogeneity's role as the baseline for valid causal inference in econometric analysis.[6] Endogeneity occurs precisely when these exogeneity conditions are violated, resulting in correlation between explanatory variables and the error term.Endogeneity
In econometrics, endogeneity arises when one or more explanatory variables in a regression model are correlated with the error term, violating the core assumption of exogeneity that requires such variables to be uncorrelated with unobserved factors influencing the dependent variable. This correlation, formally expressed as , implies that the explanatory variables do not originate solely from external sources but are influenced by the same underlying processes captured in the error term.[1][9] Under endogeneity, ordinary least squares (OLS) estimators fail to provide consistent estimates of the true parameters. Specifically, the probability limit of the OLS estimator is given by where the second term represents the bias due to the non-zero covariance between and . This deviation from the true parameter persists even as the sample size grows, leading to systematically incorrect inferences about causal relationships.[10] Endogeneity can stem from various sources, including omitted factors that jointly affect both the explanatory variables and the outcome, reverse causality where the dependent variable influences the explanatory variables, or errors in measuring the explanatory variables that propagate into the error term. A classic illustration is a wage equation model where years of education is used to explain log wages, but innate ability is omitted from the specification; since ability correlates with both education choices and wages, education becomes endogenous, biasing the estimated return to education upward.[11] The presence of endogeneity generally results in inconsistent point estimates, undermining the reliability of regression-based conclusions across empirical economic analyses, though it does not directly impair estimator efficiency in finite samples.[10]Causes in Static Models
Omitted Variables
Omitted variable bias arises in static linear regression models when a relevant explanatory variable is excluded from the specification, leading to correlation between the included regressors and the error term. Consider the true population model , where is an omitted variable that affects the outcome , and is uncorrelated with both and .[12] If is omitted, the estimated model becomes , where the composite error . This omission induces endogeneity if and are correlated, as .[13][14] The direction of the resulting bias in the estimator depends on the signs of and . In the simple case with a single included regressor, the expected value of the ordinary least squares (OLS) estimator is , which deviates from the true unless or .[12][15] This bias is inconsistent, persisting even as sample size increases, and can lead to misleading inferences about the causal effect of on .[14] A classic example occurs in estimating the returns to education on wages, where innate ability is often omitted. Ability positively affects both wages and educational attainment, so omitting it results in and , yielding an upward-biased estimate of the education coefficient—potentially overstating returns by 20-30% or more in cross-sectional data.[16][17] To recover unbiased estimates, researchers may include proxy variables that capture the omitted factor without introducing new correlations, or exploit panel data to control for time-invariant unobserved heterogeneity through fixed effects.[18][19]Measurement Error
Measurement error in explanatory variables represents a key source of endogeneity in static econometric models, where the observed variable becomes correlated with the model's error term, violating the exogeneity assumption required for consistent ordinary least squares (OLS) estimation.[20] In the classical measurement error framework, the true unobserved explanatory variable relates to the observed as , where is the measurement error satisfying and uncorrelated with the true error in the structural equation . Substituting yields the observed regression , with composite error . This induces endogeneity because (assuming ), as the measurement error enters the error term with an opposite sign to . Consequently, the probability limit of the OLS estimator is , resulting in attenuation bias toward zero whose severity depends on the signal-to-noise ratio .[20][21] Non-classical measurement error arises when correlates with or , further complicating the endogeneity and potentially reversing the bias direction. For instance, if respondents report values as optimal predictors of the true (e.g., due to recall bias in survey data), the error may negatively correlate with , amplifying attenuation or even causing positive bias in . The general form of inconsistency is , where captures the correlation structure, allowing bias in either direction depending on the sign and magnitude of . This non-classical case often prevails in empirical settings with self-reported data, leading to unpredictable inconsistencies that undermine causal inference.[22][21] A representative example occurs in growth regressions, where measurement errors in explanatory variables like initial GDP or education levels bias estimates of their effects on economic growth. Such biases mask the true impact of human capital on growth, prompting corrections via instrumental variables or multiple measurements.[20] The implications of measurement error extend to the dependent variable as well. In classical cases for with uncorrelated to regressors, OLS coefficients remain consistent, though standard errors inflate due to increased residual variance. However, non-classical errors in —such as systematic underreporting—can correlate with the error term, inducing endogeneity and potentially amplifying bias away from zero in the presence of other violations like omitted variables. Overall, while classical errors in explanatory variables typically attenuate effects, non-classical forms demand careful modeling to avoid severe inconsistencies.[22][23]Simultaneity
Simultaneity in static econometric models occurs when multiple endogenous variables are jointly determined through mutual causal relationships, resulting in a correlation between the explanatory variables and the disturbance terms in the structural equations. This correlation violates the strict exogeneity assumption required for consistent estimation using ordinary least squares (OLS), leading to biased and inconsistent parameter estimates. In such systems, the contemporaneous interdependence means that no variable can be treated as truly exogenous, as each influences the others within the same time period.[24] A prototypical mechanism is found in supply and demand systems, where price and quantity are simultaneously determined. Consider the structural demand equation and supply equation , where at equilibrium, represents demand shifters (e.g., consumer income), denotes supply shifters (e.g., production costs), and are structural errors potentially correlated, such as . Solving for the reduced form yields and , where includes the exogenous shifters and aggregate the structural errors. However, because the reduced-form errors incorporate the correlated structural disturbances, regressing the structural form on observed data introduces endogeneity, as (or ) correlates with the composite error in either equation. This simultaneity bias prevents direct recovery of structural parameters like or without additional identifying restrictions.[24][25] In a general linear simultaneous equations system with two endogenous variables, the model takes the form and , where and are vectors of exogenous variables excluded from the respective opposite equations, and . The mutual dependence implies that is endogenous in the first equation due to its correlation with through and the shared error covariance, and similarly for in the second. Identification requires satisfying order and rank conditions on the exclusion restrictions, such as at least one exogenous variable unique to each equation to trace structural effects. Seminal analysis by Koopmans established these criteria, highlighting how failure to impose such restrictions renders the system underidentified and exacerbates the bias from simultaneity.[24] A representative application appears in labor market models, where wages and hours worked (or employment levels) are endogenously determined by intersecting labor supply and demand curves. Labor supply increases with wages due to income and substitution effects, while demand decreases with wages given productivity constraints, creating bidirectional causality; estimating either curve via OLS thus yields inconsistent estimates unless instruments break the simultaneity. This static setup assumes no intertemporal lags, focusing on contemporaneous equilibrium rather than dynamic adjustments over time.[26][25]Causes in Dynamic Models
Lagged Dependent Variables
In dynamic econometric models, the inclusion of lagged dependent variables as regressors introduces endogeneity when the lagged term correlates with the current error term. Consider the autoregressive model of order one (AR(1)):where is the dependent variable at time , is the autoregressive parameter, are exogenous regressors, and is the error term. If the errors exhibit serial correlation, such as with and iid, then , because embeds past errors that influence . Specifically, , rendering the lagged dependent variable endogenous and biasing ordinary least squares (OLS) estimates.[27] This endogeneity arises from temporal dependence in the data-generating process, distinguishing it from static models where simultaneity involves contemporaneous mutual causation without lagged effects. In panel data settings with fixed effects, the problem persists even without serial correlation in the idiosyncratic errors, due to the correlation between the lagged dependent variable and the transformed errors after demeaning. For an AR(1) panel model with individual fixed effects and time dimension , the within-group estimator of suffers from a finite- bias, known as the Nickell bias. Nickell (1981) derives an approximate bias formula for this estimator as , which approaches for large and , leading to downward bias in the estimated persistence parameter.[28][29] A representative example occurs in empirical economic growth models, where lagged GDP per capita is included to capture convergence dynamics but introduces endogeneity due to persistent shocks affecting both current and past output. In cross-country panel regressions of the form , where is GDP per capita for country at time and are controls like investment rates, the lagged term correlates with the error through unobserved persistent factors such as technology shocks or policy inertia. Caselli, Esquivel, and Lefort (1996) address this using generalized method of moments (GMM) estimators in dynamic panels, finding convergence rates around 10% per year after correcting for the endogeneity of initial income levels.[30]
