Hubbry Logo
Homoscedasticity and heteroscedasticityHomoscedasticity and heteroscedasticityMain
Open search
Homoscedasticity and heteroscedasticity
Community hub
Homoscedasticity and heteroscedasticity
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Homoscedasticity and heteroscedasticity
Homoscedasticity and heteroscedasticity
from Wikipedia
Plot with random data showing homoscedasticity: at each value of x, the y-value of the dots has about the same variance.
Plot with random data showing heteroscedasticity: The variance of the y-values of the dots increases with increasing values of x.

In statistics, a sequence of random variables is homoscedastic (/ˌhmskəˈdæstɪk/) if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. “Skedasticity” comes from the Ancient Greek word “skedánnymi”, meaning “to scatter”.[1][2][3] Assuming a variable is homoscedastic when in reality it is heteroscedastic (/ˌhɛtərskəˈdæstɪk/) results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

The existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance that assume that the modelling errors all have the same variance. While the ordinary least squares (OLS) estimator is still unbiased in the presence of heteroscedasticity, it is inefficient and inference based on the assumption of homoskedasticity is misleading. In that case, generalized least squares (GLS) was frequently used in the past.[4][5] Nowadays, standard practice in econometrics is to include Heteroskedasticity-consistent standard errors instead of using GLS, as GLS can exhibit strong bias in small samples if the actual skedastic function is unknown.[6]

Because heteroscedasticity concerns expectations of the second moment of the errors, its presence is referred to as misspecification of the second order.[7]

The econometrician Robert Engle was awarded the 2003 Nobel Memorial Prize for Economics for his studies on regression analysis in the presence of heteroscedasticity, which led to his formulation of the autoregressive conditional heteroscedasticity (ARCH) modeling technique.[8]

Definition

[edit]

Consider the linear regression equation where the dependent random variable equals the deterministic variable times coefficient plus a random disturbance term that has mean zero. The disturbances are homoscedastic if the variance of is a constant ; otherwise, they are heteroscedastic. In particular, the disturbances are heteroscedastic if the variance of depends on or on the value of . One way they might be heteroscedastic is if (an example of a scedastic function), so the variance is proportional to the value of .

More generally, if the variance-covariance matrix of disturbance across has a nonconstant diagonal, the disturbance is heteroscedastic.[9] The matrices below are covariances when there are just three observations across time. The disturbance in matrix A is homoscedastic; this is the simple case where OLS is the best linear unbiased estimator. The disturbances in matrices B and C are heteroscedastic. In matrix B, the variance is time-varying, increasing steadily across time; in matrix C, the variance depends on the value of . The disturbance in matrix D is homoscedastic because the diagonal variances are constant, even though the off-diagonal covariances are non-zero and ordinary least squares is inefficient for a different reason: serial correlation.

Examples

[edit]

Heteroscedasticity often occurs when there is a large difference among the sizes of the observations.

A classic example of heteroscedasticity is that of income versus expenditure on meals. A wealthy person may eat inexpensive food sometimes and expensive food at other times. A poor person will almost always eat inexpensive food. Therefore, people with higher incomes exhibit greater variability in expenditures on food.

At a rocket launch, an observer measures the distance traveled by the rocket once per second. In the first couple of seconds, the measurements may be accurate to the nearest centimeter. After five minutes, the accuracy of the measurements may be good only to 100 m, because of the increased distance, atmospheric distortion, and a variety of other factors. So the measurements of distance may exhibit heteroscedasticity.

Consequences

[edit]

One of the assumptions of the classical linear regression model is that there is no heteroscedasticity. Breaking this assumption means that the Gauss–Markov theorem does not apply, meaning that OLS estimators are not the Best Linear Unbiased Estimators (BLUE) and their variance is not the lowest of all other unbiased estimators. Heteroscedasticity does not cause ordinary least squares coefficient estimates to be biased, although it can cause ordinary least squares estimates of the variance (and, thus, standard errors) of the coefficients to be biased, possibly above or below the true of population variance. Thus, regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. Biased standard errors lead to biased inference, so results of hypothesis tests are possibly wrong. For example, if OLS is performed on a heteroscedastic data set, yielding biased standard error estimation, a researcher might fail to reject a null hypothesis at a given significance level, when that null hypothesis was actually uncharacteristic of the actual population (making a type II error).

Under certain assumptions, the OLS estimator has a normal asymptotic distribution when properly normalized and centered (even when the data does not come from a normal distribution). This result is used to justify using a normal distribution, or a chi square distribution (depending on how the test statistic is calculated), when conducting a hypothesis test. This holds even under heteroscedasticity. More precisely, the OLS estimator in the presence of heteroscedasticity is asymptotically normal, when properly normalized and centered, with a variance-covariance matrix that differs from the case of homoscedasticity. In 1980, White proposed a consistent estimator for the variance-covariance matrix of the asymptotic distribution of the OLS estimator.[2] This validates the use of hypothesis testing using OLS estimators and White's variance-covariance estimator under heteroscedasticity.

Heteroscedasticity is also a major practical issue encountered in ANOVA problems.[10] The F test can still be used in some circumstances.[11]

However, it has been said that students in econometrics should not overreact to heteroscedasticity.[3] One author wrote, "unequal error variance is worth correcting only when the problem is severe."[12] In addition, another word of caution was in the form, "heteroscedasticity has never been a reason to throw out an otherwise good model."[3][13] With the advent of heteroscedasticity-consistent standard errors allowing for inference without specifying the conditional second moment of error term, testing conditional homoscedasticity is not as important as in the past.[6]

For any non-linear model (for instance Logit and Probit models), however, heteroscedasticity has more severe consequences: the maximum likelihood estimates (MLE) of the parameters will usually be biased, as well as inconsistent (unless the likelihood function is modified to correctly take into account the precise form of heteroscedasticity or the distribution is a member of the linear exponential family and the conditional expectation function is correctly specified).[14][15] Yet, in the context of binary choice models (Logit or Probit), heteroscedasticity will only result in a positive scaling effect on the asymptotic mean of the misspecified MLE (i.e. the model that ignores heteroscedasticity).[16] As a result, the predictions which are based on the misspecified MLE will remain correct. In addition, the misspecified Probit and Logit MLE will be asymptotically normally distributed which allows performing the usual significance tests (with the appropriate variance-covariance matrix). However, regarding the general hypothesis testing, as pointed out by Greene, "simply computing a robust covariance matrix for an otherwise inconsistent estimator does not give it redemption. Consequently, the virtue of a robust covariance matrix in this setting is unclear."[17]

Correction

[edit]

There are several common corrections for heteroscedasticity. They are:

  • A stabilizing transformation of the data, e.g. logarithmized data. Non-logarithmized series that are growing exponentially often appear to have increasing variability as the series rises over time. The variability in percentage terms may, however, be rather stable.
  • Use a different specification for the model (different X variables, or perhaps non-linear transformations of the X variables).
  • Apply a weighted least squares estimation method, in which OLS is applied to transformed or weighted values of X and Y. The weights vary over observations, usually depending on the changing error variances. In one variation the weights are directly related to the magnitude of the dependent variable, and this corresponds to least squares percentage regression.[18]
  • Heteroscedasticity-consistent standard errors (HCSE), while still biased, improve upon OLS estimates.[2] HCSE is a consistent estimator of standard errors in regression models with heteroscedasticity. This method corrects for heteroscedasticity without altering the values of the coefficients. This method may be superior to regular OLS because if heteroscedasticity is present it corrects for it, however, if the data is homoscedastic, the standard errors are equivalent to conventional standard errors estimated by OLS. Several modifications of the White method of computing heteroscedasticity-consistent standard errors have been proposed as corrections with superior finite sample properties.
  • Wild bootstrapping can be used as a Resampling method that respects the differences in the conditional variance of the error term. An alternative is resampling observations instead of errors. Note resampling errors without respect for the affiliated values of the observation enforces homoskedasticity and thus yields incorrect inference.
  • Use MINQUE or even the customary estimators (for independent samples with observations each), whose efficiency losses are not substantial when the number of observations per sample is large (), especially for small number of independent samples.[19]

Testing

[edit]
Absolute value of residuals for simulated first order heteroscedastic data

Residuals can be tested for homoscedasticity using the Breusch–Pagan test,[20] which performs an auxiliary regression of the squared residuals on the independent variables. From this auxiliary regression, the explained sum of squares is retained, divided by two, and then becomes the test statistic for a chi-squared distribution with the degrees of freedom equal to the number of independent variables.[21] The null hypothesis of this chi-squared test is homoscedasticity, and the alternative hypothesis would indicate heteroscedasticity. Since the Breusch–Pagan test is sensitive to departures from normality or small sample sizes, the Koenker–Bassett or 'generalized Breusch–Pagan' test is commonly used instead.[22][additional citation(s) needed] From the auxiliary regression, it retains the R-squared value which is then multiplied by the sample size, and then becomes the test statistic for a chi-squared distribution (and uses the same degrees of freedom). Although it is not necessary for the Koenker–Bassett test, the Breusch–Pagan test requires that the squared residuals also be divided by the residual sum of squares divided by the sample size.[22] Testing for groupwise heteroscedasticity can be done with the Goldfeld–Quandt test.[23]

Due to the standard use of heteroskedasticity-consistent Standard Errors and the problem of Pre-test, econometricians nowadays rarely use tests for conditional heteroskedasticity.[6]

List of tests

[edit]

Although tests for heteroscedasticity between groups can formally be considered as a special case of testing within regression models, some tests have structures specific to this case.

Generalisations

[edit]

Homoscedastic distributions

[edit]

Two or more normal distributions, are both homoscedastic and lack serial correlation if they share the same diagonals in their covariance matrix, and their non-diagonal entries are zero. Homoscedastic distributions are especially useful to derive statistical pattern recognition and machine learning algorithms. One popular example of an algorithm that assumes homoscedasticity is Fisher's linear discriminant analysis. The concept of homoscedasticity can be applied to distributions on spheres.[27]

Multivariate data

[edit]

The study of homescedasticity and heteroscedasticity has been generalized to the multivariate case, which deals with the covariances of vector observations instead of the variance of scalar observations. One version of this is to use covariance matrices as the multivariate measure of dispersion. Several authors have considered tests in this context, for both regression and grouped-data situations.[28][29] Bartlett's test for heteroscedasticity between grouped data, used most commonly in the univariate case, has also been extended for the multivariate case, but a tractable solution only exists for 2 groups.[30] Approximations exist for more than two groups, and they are both called Box's M test.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Homoscedasticity and heteroscedasticity are fundamental concepts in statistics that describe the consistency of variance in the residuals or error terms of a model, most notably in linear regression analysis. Homoscedasticity refers to the condition where the variance of these residuals remains constant across all levels of the independent variables, ensuring equal spread of errors regardless of the predicted values. In contrast, heteroscedasticity occurs when the variance of the residuals is unequal or changes systematically, often increasing or decreasing with the magnitude of the independent variables or fitted values. These properties are essential assumptions underlying many parametric statistical tests and models. The assumption of homoscedasticity is central to the validity of ordinary least squares (OLS) regression, as it guarantees that the estimators are not only unbiased but also the most efficient (with the lowest variance) among linear unbiased estimators. Violation through heteroscedasticity, however, leads to inefficient estimates, underestimated or overestimated standard errors, and unreliable p-values or confidence intervals, potentially resulting in incorrect inferences about relationships in the data. For instance, in cross-sectional economic data, heteroscedasticity might arise from larger errors in observations with higher values, such as income levels affecting consumption variability. This issue is particularly critical in fields like and biomedical research, where ignoring it can invalidate analyses of variance (ANOVA) or t-tests more severely than non-normality of residuals. Detecting heteroscedasticity typically involves visual inspection of residual plots—where a fan-shaped or increasing spread indicates the problem—or formal statistical tests such as the Breusch-Pagan test, which regresses squared residuals on the independent variables to check for significance, or the for general forms without specifying the heteroscedasticity structure. Remedies include transforming variables (e.g., using logarithms to stabilize variance), applying to downweight observations with larger errors, or using heteroscedasticity-robust standard errors that adjust inference without altering the model. These approaches ensure more reliable model diagnostics and interpretations, enhancing the robustness of statistical conclusions in regression-based studies.

Definitions

Homoscedasticity

Homoscedasticity refers to the property in a where the variance of the error terms, or residuals, remains constant across all levels of the independent variables. This assumption ensures that the spread of residuals does not systematically increase or decrease with the predicted values, providing a stable measure of variability in the data. In the context of a model expressed as Y=Xβ+εY = X\beta + \varepsilon, homoscedasticity is mathematically defined by the condition that the variance of each error term is identical, i.e., Var(εi)=σ2\operatorname{Var}(\varepsilon_i) = \sigma^2 for all observations ii, where σ2\sigma^2 is a positive constant. This uniformity in error variance is a key component of the classical framework. The term "homoscedasticity" was coined by in 1905, derived from the Greek words homo (meaning "same") and skedasis (meaning "dispersion" or "scattering"). Pearson introduced it in his work on skew correlation and non-linear regression to describe arrays of data with equal scatter around their means. As a foundational assumption in ordinary least squares (OLS) regression, homoscedasticity is essential for the OLS to be unbiased and efficient, as established by the Gauss-Markov theorem, which identifies OLS as the best linear unbiased (BLUE) under these conditions. Without it, while OLS remains unbiased, the estimators may not achieve minimum variance, potentially leading to inefficient inferences.

Heteroscedasticity

Heteroscedasticity occurs in statistical models, particularly , when the variance of the error terms is not constant across all observations, but instead varies, often in a systematic manner related to the levels of variables. This phenomenon contrasts with the ideal of constant variance and can arise due to inherent properties of the data-generating process, such as increasing at higher predicted values or differing scales in subpopulations. The term "heteroscedasticity" was coined by in 1905, derived from the Greek words hetero (meaning "different") and skedasis (meaning "dispersion" or "scattering"). introduced it in his work on skew correlation and non-linear regression to describe arrays of data with unequal scatter around their means. Mathematically, heteroscedasticity is expressed as Var(εiXi)=σi2\operatorname{Var}(\varepsilon_i \mid X_i) = \sigma_i^2, where the conditional variance σi2\sigma_i^2 is a function of the predictors XiX_i, commonly modeled as σi2=σ2h(Xi)\sigma_i^2 = \sigma^2 \cdot h(X_i) for some positive function hh. Common forms include multiplicative heteroscedasticity, in which the variance is proportional to the square of the mean (e.g., σi2μi2\sigma_i^2 \propto \mu_i^2), often seen in models with multiplicative errors, and additive heteroscedasticity, where the variance includes a constant addition (e.g., σi2=σ2+g(Xi)\sigma_i^2 = \sigma^2 + g(X_i))..pdf) This varying dispersion directly violates the homoscedasticity assumption of the Gauss-Markov theorem, which requires constant error variance for ordinary least squares estimators to be the best linear unbiased estimators in terms of minimum variance. As a result, while OLS remains unbiased under heteroscedasticity, it loses compared to estimators that account for the varying variances.

Examples

Univariate Cases

In univariate cases, homoscedasticity and heteroscedasticity can be illustrated using simple datasets consisting of one primary variable of interest, often conditioned on a grouping or ordering variable to demonstrate variance patterns without invoking predictive modeling. These examples help build by showing how the spread of points remains constant or changes systematically. A classic illustration of homoscedasticity involves drawn from a with fixed variance, such as the heights of adults within a homogeneous group, like young adults of the same and socioeconomic background. In such cases, the spread of heights is consistent across subgroups defined by minor categorizations, such as small age ranges (e.g., 20-25 years vs. 26-30 years), reflecting a constant variance that does not fan out or contract. Histograms or boxplots of these height measurements typically display similar widths across bins, indicating uniform dispersion. To quantify this, sample variances can be calculated for subgroups; for instance, if the variance in heights for the 20-25 age group is approximately 25 cm² and for the 26-30 age group is 24 cm², the near-equality supports homoscedasticity. In contrast, heteroscedasticity is evident in datasets where the variance increases (or decreases) with levels of an ordering variable, such as levels across different age groups. A representative example is plotted against age, where younger age groups (e.g., 20-30 years) show a narrow spread of incomes around a low , while older groups (e.g., 50-60 years) exhibit a wider spread due to greater variability in career outcomes and earnings potential. Visually, a of these reveals a "fan" shape, with points clustering tightly at low ages and spreading outward at higher ages, unlike the parallel bands seen in homoscedastic plots. Calculating sample variances across age subgroups confirms this; for example, the variance might be $5000² for the 20-30 group but rise to $15000² for the 50-60 group, demonstrating increasing dispersion. These univariate illustrations highlight the core distinction in variance behavior and extend naturally to more complex scenarios like regression models, where similar patterns appear in residual spreads.

Regression Contexts

In the context of models, homoscedasticity plays a crucial role as one of the core assumptions underlying the ordinary (OLS) estimation. Consider the model Yi=β0+β1Xi+εiY_i = \beta_0 + \beta_1 X_i + \varepsilon_i, where YiY_i is the dependent variable, XiX_i is the independent variable, and εi\varepsilon_i represents the error term for the ii-th observation. Homoscedasticity requires that the variance of the error terms, Var(εi)\text{Var}(\varepsilon_i), remains constant across all levels of the predictor XiX_i, ensuring that the model's predictions have consistent reliability regardless of the value of XX. Violations of this assumption, known as heteroscedasticity, manifest in residual plots where the spread of residuals widens or narrows systematically with fitted values, indicating unequal error variances that can distort the interpretation of the regression line. Residuals in serve as the primary diagnostic tool for assessing homoscedasticity, defined as the differences between observed and predicted values: ε^i=YiY^i\hat{\varepsilon}_i = Y_i - \hat{Y}_i, where Y^i=β^0+β^1Xi\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i is the fitted value from the OLS estimates. Under homoscedasticity, these residuals should exhibit a constant variance, appearing as a random, even cluster around the zero line in a residuals-versus-fitted-values plot, with no discernible of increasing or decreasing spread. In contrast, heteroscedastic residuals display a funnel-shaped , where the variability expands (or contracts) as fitted values increase, signaling that the assumption violated and prompting further investigation. A illustrative example of heteroscedasticity arises in regressions of wages on years of work , a common application in labor economics. In such models, residuals often show increasing variance at higher levels of , as more seasoned workers face greater heterogeneity in earnings due to factors like career paths, industry shifts, or unmeasured skills, leading to wider spreads in the error terms for higher XiX_i values. This pattern contrasts with homoscedastic scenarios, where wage deviations remain uniformly scattered across all experience levels. Heteroscedasticity is particularly prevalent in cross-sectional economic data, such as regressions analyzing firm profits against size or . For instance, larger firms typically exhibit more variable profits due to diverse revenue streams, operational complexities, and exposure to market fluctuations, resulting in error variances that grow with firm scale. This non-constant variance challenges the reliability of OLS estimates in such datasets, where smaller firms might show tightly clustered residuals while larger ones display greater dispersion.

Consequences

Effects on Estimation

In the presence of heteroscedasticity, ordinary least squares (OLS) estimators of regression parameters remain unbiased, meaning that the of the estimator equals the true value, E[β^]=βE[\hat{\beta}] = \beta. However, these estimators lose their , exhibiting larger variance than the minimum variance achievable among linear unbiased estimators. This inefficiency arises because the assumption of constant error variance, required for the Gauss-Markov theorem to establish OLS as the best linear unbiased (BLUE), is violated. The variance-covariance matrix of the OLS estimator under homoscedasticity is given by Var(β^)=σ2(XX)1\text{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1}, where σ2\sigma^2 is the constant error variance and XX is the . Under heteroscedasticity, this simplifies incorrectly; the true structure involves a Ω\Omega with varying σi2\sigma_i^2 on the diagonal, leading to Var(β^)=(XX)1XΩX(XX)1\text{Var}(\hat{\beta}) = (X'X)^{-1} X' \Omega X (X'X)^{-1}. Consequently, the standard errors computed using the homoscedastic formula are incorrect, often underestimating the true variability of the estimates. This violation implies that the property of OLS fails, as the estimator no longer achieves the minimum variance among all linear unbiased estimators. If the form of heteroscedasticity were known, (GLS) would provide a more efficient by weighting observations according to their variances, yielding smaller variances for β^\hat{\beta}. In practice, this efficiency loss means that OLS does not exploit the varying precision of observations optimally. Empirically, heteroscedasticity results in wider intervals for predictions, particularly in regions of the predictor space where variances are high, reducing the reliability of interval estimates in those areas. For instance, in economic models where variance increases with income levels, predictions for higher-income groups will have inflated , potentially misleading policy interpretations.

Effects on Inference

Heteroscedasticity in models leads to invalid standard errors for the estimated coefficients, as the usual ordinary (OLS) formula assumes constant error variance, resulting in either underestimation or overestimation of the true variability. This bias in standard errors distorts t-statistics and associated p-values, rendering hypothesis tests about individual coefficients unreliable, since the t-distribution no longer applies under the violated assumption. For instance, when error variance is lower than assumed in certain regions of the data, standard errors may be underestimated, inflating t-statistics and producing misleadingly low p-values. The distortion in standard errors increases the risk of Type I errors—falsely rejecting null hypotheses—in regions of low error variance, where tests appear more significant than they are, and Type II errors—failing to reject false nulls—in high-variance regions, where tests lack power due to overestimated variability. Overall, these error rate imbalances mean that the nominal significance levels (e.g., 5%) do not reflect the actual probability of incorrect inferences, compromising the validity of statistical decisions in . Confidence intervals for coefficients become unreliable under heteroscedasticity, as they rely on the biased standard errors; intervals may be too narrow in low-variance areas, falsely suggesting precise estimates, or too wide elsewhere, obscuring true effects and affecting assessments of predictor significance. Similarly, the for overall model fit is invalidated, since its assumes homoscedastic errors, leading to incorrect conclusions about the joint significance of predictors.

Detection

Graphical Methods

Graphical methods provide an intuitive, preliminary approach to detecting heteroscedasticity by visually inspecting the residuals from a regression model, often revealing patterns that suggest non-constant variance before applying formal tests. These plots focus on the spread and distribution of residuals, which are the differences between observed and predicted values, and are essential for identifying violations of the homoscedasticity assumption in . The residuals versus fitted values plot is a fundamental diagnostic tool, displaying residuals on the y-axis against the model's fitted (predicted) values on the x-axis. Under homoscedasticity, the points should form a horizontal band around the zero line with constant width, indicating equal variance across all levels of the fitted values; deviations, such as a "" or cone-shaped pattern where the spread widens as fitted values increase, signal increasing heteroscedasticity. This visual pattern arises because heteroscedasticity often correlates with the magnitude of the response variable, making the plot sensitive to variance changes tied to predicted outcomes. Similarly, the residuals versus predictor plot examines residuals against predictor variables (X) on the x-axis, helping to pinpoint if heteroscedasticity is associated with specific covariates. A uniform horizontal band suggests constant variance independent of the predictor, whereas a fanning or narrowing pattern indicates that the error variance varies with levels of that particular X variable, such as in cases where variance increases with higher values of an income predictor in models. This plot is particularly useful when multiple predictors are involved, allowing targeted inspection for variable-specific effects on variance. The scale-location plot, also known as the spread-location plot, enhances detection by plotting the square root of the absolute residuals against the fitted values, which helps stabilize the variance scale and makes non-constant patterns more apparent. In this transformation, a straight horizontal line fitted through the points indicates homoscedasticity, while a curving or sloping trend reveals heteroscedasticity more clearly than the untransformed residuals plot, as the reduces the influence of extreme residuals and equalizes the visual impact of variance changes. This method is especially effective for datasets with outliers or skewed residual distributions, providing a clearer view of underlying variance heterogeneity. Quantile-quantile (Q-Q) plots, while primarily designed to assess the normality assumption by comparing residual quantiles to those of a , offer only secondary and limited insight into heteroscedasticity. Deviations from in a Q-Q plot may partially correlate with variance issues if heteroscedasticity distorts the tail behavior, but it is not reliable for direct detection, as constant variance violations can occur without markedly affecting quantile alignments. For this reason, Q-Q plots should be supplemented with dedicated variance-focused graphics rather than relied upon in isolation for heteroscedasticity diagnosis.

Formal Tests

Formal statistical tests for heteroscedasticity provide rigorous, quantitative methods to assess whether the variance of residuals in a regression model is constant, offering p-values to support or reject the . These tests are typically applied after fitting an ordinary least squares (OLS) model and examining the residuals. The H0H_0 posits homoscedasticity, meaning the error variance is constant across all levels of the independent variables, while the HaH_a indicates heteroscedasticity, where the variance varies. The Breusch-Pagan test, proposed in , is a (LM) test that assumes the heteroscedasticity follows a specific functional form related to the predictors. It involves first estimating the OLS model to obtain residuals e^i\hat{e}_i, then regressing the squared residuals e^i2\hat{e}_i^2 on the independent variables XX. The test statistic is computed as LM=nR2LM = n R^2, where nn is the sample size and R2R^2 is the from the auxiliary regression; under H0H_0, this statistic asymptotically follows a χ2\chi^2 distribution with kk , where kk is the number of predictors in the auxiliary regression. Rejection of H0H_0 at a chosen significance level suggests heteroscedasticity. The White test, introduced in 1980, offers a more general approach that does not presuppose a particular form of heteroscedasticity, making it robust to unknown variance structures. Similar to the Breusch-Pagan test, it begins with OLS residuals, but the auxiliary regression includes the original predictors XX, their squares X2X^2, and all cross-products among the predictors. The test statistic is again an LM statistic, LM=nR2LM = n R^2, distributed asymptotically as χ2\chi^2 with degrees of freedom equal to the number of terms in the auxiliary regression minus one. This broader specification detects a wider range of heteroscedasticity patterns but may suffer from reduced power in small samples due to the increased number of parameters. The Goldfeld-Quandt test, developed in 1965, is a parametric F-test suited for cases where heteroscedasticity is suspected to increase monotonically with a specific predictor, often after ordering the data by that variable. The procedure splits the ordered sample into three parts, discarding the middle portion to separate low and high values of the predictor, then fits separate OLS models to the first (low) and last (high) subsets. The test compares the from the high-variance subset to that from the low-variance subset via an F-statistic, F=RSSH/(nHp)RSSL/(nLp)F = \frac{RSS_H / (n_H - p)}{RSS_L / (n_L - p)}, where subscripts HH and LL denote high and low groups, nn is the subset size, and pp is the number of parameters; under H0H_0, this follows an with (nHp,nLp)(n_H - p, n_L - p) . Despite their utility, these formal tests share key limitations rooted in their statistical foundations. They generally assume normality of the error terms for the asymptotic distributions to hold exactly, and violations can lead to size distortions or reduced reliability. Additionally, their power to detect heteroscedasticity varies with sample size, often being low in small samples where subtle variance changes may go undetected, while large samples can yield significant results even for minor deviations.

Corrections

Transformations

Transformations of the response variable are commonly applied to address heteroscedasticity by stabilizing the variance, thereby approximating homoscedasticity in regression models. These methods alter the scale of the data to make the variance of the transformed variable more constant across levels of the predictors, often based on the observed form of heteroscedasticity identified through residual diagnostics. The logarithmic transformation, log(Y), is particularly effective for multiplicative heteroscedasticity where the variance of the original response is proportional to the square of the mean, Var(Y) ∝ [E(Y)]². In such cases, the transformation yields an approximately constant variance for the logged response, Var(log(Y)) ≈ constant, which is useful for data exhibiting or positive skew. For heteroscedasticity where the variance is proportional to the mean, Var(Y) ∝ E(Y), as often seen in count data following a , the transformation √Y stabilizes the variance to approximately Var(√Y) ≈ 1/4. This approach reduces the wedge-shaped pattern in residual plots and is a variance-stabilizing method originally proposed for of variance with Poisson-like variability. The Box-Cox family of power transformations provides a more general framework, defined as Y^(λ) = (Y^λ - 1)/λ for λ ≠ 0 and log(Y) for λ = 0, where the parameter λ is selected to minimize the residual variance in the transformed model. Introduced by Box and Cox, this method allows flexible adjustment to achieve both normality and homoscedasticity by estimating λ via maximum likelihood, encompassing special cases like the log (λ=0) and square root (λ=0.5) transformations. These transformations are typically applied when residual plots from initial regression diagnostics reveal variance increasing with the fitted values or predictors, indicating heteroscedasticity. They can preserve interpretability, especially the log transformation in economic models where coefficients represent elasticities, but require positive data and careful back-transformation for predictions.

Weighted and Robust Methods

When heteroscedasticity is present and its form is known or can be reasonably estimated, (WLS) provides an efficient estimation method by assigning weights inversely proportional to the error variances. Introduced by Aitken, WLS minimizes the weighted sum of squared residuals, given by β^WLS=argminβi=1nwi(yixiβ)2,\hat{\beta}_{WLS} = \arg\min_{\beta} \sum_{i=1}^n w_i (y_i - \mathbf{x}_i' \beta)^2, where wi=1/σi2w_i = 1/\sigma_i^2 and σi2\sigma_i^2 is the variance of the ii-th error term. To implement WLS, the variances σi2\sigma_i^2 must first be estimated, often through methods such as grouped regression, where observations are sorted by a suspected heteroscedasticity driver (e.g., fitted values) and variances are computed within groups. Feasible generalized least squares (FGLS) extends WLS for cases where the exact variance structure is unknown but can be approximated iteratively. FGLS begins with an initial ordinary (OLS) fit to obtain residuals ϵ^i\hat{\epsilon}_i, from which preliminary variance estimates σ^i2\hat{\sigma}_i^2 (e.g., σ^i2=ϵ^i2/(1hii)\hat{\sigma}_i^2 = \hat{\epsilon}_i^2 / (1 - h_{ii}), where hiih_{ii} is the ii-th leverage) are derived to construct weights; a weighted regression is then performed, and the process iterates until convergence. This approach yields asymptotically efficient estimates under correct specification of the variance form but can be inefficient or biased if the heteroscedasticity model is misspecified. For unknown heteroscedasticity forms, heteroscedasticity-consistent (HC) standard errors adjust inference without refitting the model, preserving OLS point estimates while correcting the . White's seminal estimator computes the variance of the OLS coefficients as Var^(β^OLS)=(XX)1(i=1nxiϵ^i2xi)(XX)1,\widehat{\mathrm{Var}}(\hat{\beta}_{OLS}) = (X'X)^{-1} \left( \sum_{i=1}^n \mathbf{x}_i \hat{\epsilon}_i^2 \mathbf{x}_i' \right) (X'X)^{-1},
Add your contribution
Related Hubs
User Avatar
No comments yet.