Recent from talks
Contribute something
Nothing was collected or created yet.
Regression validation
View on Wikipedia| Part of a series on |
| Regression analysis |
|---|
| Models |
| Estimation |
| Background |
In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.
Goodness of fit
[edit]One measure of goodness of fit is the coefficient of determination, often denoted, R2. In ordinary least squares with an intercept, it ranges between 0 and 1. However, an R2 close to 1 does not guarantee that the model fits the data well. For example, if the functional form of the model does not match the data, R2 can be high despite a poor model fit. Anscombe's quartet consists of four example data sets with similarly high R2 values, but data that sometimes clearly does not fit the regression line. Instead, the data sets include outliers, high-leverage points, or non-linearities.
One problem with the R2 as a measure of model validity is that it can always be increased by adding more variables into the model, except in the unlikely event that the additional variables are exactly uncorrelated with the dependent variable in the data sample being used. This problem can be avoided by doing an F-test of the statistical significance of the increase in the R2, or by instead using the adjusted R2.
Analysis of residuals
[edit]The residuals from a fitted model are the differences between the responses observed at each combination of values of the explanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically, the definition of the residual for the ith observation in the data set is written

with yi denoting the ith response in the data set and xi the vector of explanatory variables, each set at the corresponding values found in the ith observation in the data set.
If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. The next section details the types of plots to use to test different aspects of a model and gives the correct interpretations of different results that could be observed for each type of plot.
Graphical analysis of residuals
[edit]A basic, though not quantitatively precise, way to check for problems that render a model inadequate is to conduct a visual examination of the residuals (the mispredictions of the data used in quantifying the model) to look for obvious deviations from randomness. If a visual examination suggests, for example, the possible presence of heteroscedasticity (a relationship between the variance of the model errors and the size of an independent variable's observations), then statistical tests can be performed to confirm or reject this hunch; if it is confirmed, different modeling procedures are called for.
Different types of plots of the residuals from a fitted model provide information on the adequacy of different aspects of the model.
- sufficiency of the functional part of the model: scatter plots of residuals versus predictors
- non-constant variation across the data: scatter plots of residuals versus predictors; for data collected over time, also plots of residuals against time
- drift in the errors (data collected over time): run charts of the response and errors versus time
- independence of errors: lag plot
- normality of errors: histogram and normal probability plot
Graphical methods have an advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between the model and the data.
Quantitative analysis of residuals
[edit]Numerical methods also play an important role in model validation. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. One common situation when numerical validation methods take precedence over graphical methods is when the number of parameters being estimated is relatively close to the size of the data set. In this situation residual plots are often difficult to interpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area in which this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area in which graphical residual analysis can be difficult.
Serial correlation of the residuals can indicate model misspecification, and can be checked for with the Durbin–Watson statistic. The problem of heteroskedasticity can be checked for in any of several ways.
Out-of-sample evaluation
[edit]Cross-validation is the process of assessing how the results of a statistical analysis will generalize to an independent data set. If the model has been estimated over some, but not all, of the available data, then the model using the estimated parameters can be used to predict the held-back data. If, for example, the out-of-sample mean squared error, also known as the mean squared prediction error, is substantially higher than the in-sample mean square error, this is a sign of deficiency in the model.
A development in medical statistics is the use of out-of-sample cross validation techniques in meta-analysis. It forms the basis of the validation statistic, Vn, which is used to test the statistical validity of meta-analysis summary estimates. Essentially it measures a type of normalized prediction error and its distribution is a linear combination of χ2 variables of degree 1. [1]
See also
[edit]References
[edit]This article needs additional citations for verification. (March 2010) |
- ^ Willis BH, Riley RD (2017). "Measuring the statistical validity of summary meta-analysis and meta-regression results for use in clinical practice". Statistics in Medicine. 36 (21): 3283–3301. doi:10.1002/sim.7372. PMC 5575530. PMID 28620945.
Further reading
[edit]- Arboretti Giancristofaro, R.; Salmaso, L. (2003), "Model performance analysis and model validation in logistic regression", Statistica, 63: 375–396
- Kmenta, Jan (1986), Elements of Econometrics (Second ed.), Macmillan, pp. 593–600; republished in 1997 by University of Michigan Press
External links
[edit]Regression validation
View on GrokipediaCore Assumptions in Regression
Linearity Assumption
In linear regression models, the linearity assumption requires that the expected value of the response variable is a linear function of the predictor variables. This principle underlies both simple linear regression, modeled as , and multiple linear regression, expressed as . The assumption encompasses additivity, where the effects of predictors on the response are independent and combine linearly without interactions or curvature influencing the slope.[4][5][6] To check the linearity assumption, diagnostic plots are commonly used. Scatter plots of the observed response against each predictor provide an initial visual assessment of linear trends. Residuals versus fitted values or versus individual predictors should exhibit a random scatter around zero with no discernible patterns, such as bows or curves, which would signal nonlinearity. In multiple regression settings, component-plus-residual plots (or partial residual plots) help evaluate the linear contribution of each predictor after accounting for the others.[6][5][6] Violation of the linearity assumption leads to biased coefficient estimates, diminished predictive accuracy, and compromised statistical inference, including unreliable p-values for hypothesis tests. Nonlinear patterns can cause systematic errors in predictions, especially when extrapolating beyond the range of observed data.[5][6] Remedies for addressing nonlinearity include augmenting the model with polynomial terms, such as quadratic components like , to model curvature. Transformations of variables, including logarithmic (e.g., ) or square root functions, can often restore linearity by stabilizing exponential or skewed relationships. For more pronounced nonlinearity, adopting nonlinear regression techniques may be required instead of forcing a linear form.[5][6][5] As an example, consider the simple linear model . Linearity is evaluated by plotting residuals against fitted values; a pattern-free cloud of points centered on the zero line affirms the assumption, while any systematic curvature suggests refitting with polynomials or transformations.[6]Independence Assumption
In linear regression models, the independence assumption requires that the error terms for different observations are uncorrelated, formally expressed as for all .[7] This assumption ensures that the residuals do not exhibit systematic patterns of dependence, allowing the ordinary least squares (OLS) estimator to produce unbiased and efficient inferences under the Gauss-Markov theorem. Violations of this assumption arise from various data structures, including time series autocorrelation where errors in sequential observations are positively or negatively correlated due to temporal trends; spatial dependence, in which nearby geographic units influence each other, leading to correlated residuals; and clustered sampling, such as in multi-center studies where observations within the same group (e.g., hospitals or schools) share unmodeled similarities quantified by an intraclass correlation coefficient (ICC) greater than zero.[8][9][10] A primary method for detecting violations, particularly first-order autocorrelation in time-ordered data, is the Durbin-Watson test, introduced by Durbin and Watson. The test statistic is calculated as: where are the OLS residuals and is the number of observations.[8] The DW statistic ranges from 0 to 4, with a value near 2 indicating no first-order autocorrelation; values below 2 suggest positive autocorrelation (errors tend to persist), while values above 2 indicate negative autocorrelation (errors alternate in sign).[8] Critical values and from significance tables are used for hypothesis testing: if DW < , reject the null hypothesis of no autocorrelation; if DW > , fail to reject; otherwise, the result is inconclusive.[8] Violating the independence assumption leads to underestimated standard errors of the coefficient estimates, as the model fails to account for the reduced effective sample size due to dependence.[11] This underestimation inflates t-statistics, resulting in inflated Type I error rates for significance tests—potentially up to 30% or higher at moderate ICC levels like 0.3—and poor coverage of confidence intervals (e.g., dropping to 71% at ICC=0.5).[10][11] Overall, these issues render hypothesis tests unreliable and bias inferences about predictor effects.[7] Remedies for dependence include generalized least squares (GLS), which transforms the model to account for the correlation structure in the errors, yielding efficient estimates.[8] For time series autocorrelation, adding lagged dependent or independent variables can model the serial dependence explicitly, as in autoregressive (AR) specifications.[8] In cases of clustering or hierarchical data, mixed-effects models incorporate random effects to capture group-level variation, restoring valid inference.[10]Homoscedasticity Assumption
In linear regression, the homoscedasticity assumption requires that the variance of the error terms, or residuals, remains constant across all levels of the predictor variables. This is formally stated as , where is a positive constant independent of the values taken by the predictors . This assumption is one of the core conditions of the classical linear model, ensuring that ordinary least squares (OLS) estimators achieve the best linear unbiased estimator (BLUE) properties under the Gauss-Markov theorem. Violation of homoscedasticity, known as heteroscedasticity, occurs when the error variance changes systematically with the predictors, such as increasing with higher values of . To detect heteroscedasticity, analysts commonly inspect residual plots, where residuals are graphed against fitted values or predictors; a funnel-shaped pattern, with residuals spreading out as fitted values increase, signals non-constant variance. A formal statistical test is the Breusch-Pagan test, which involves regressing the squared residuals from the original model on the predictors and computing the Lagrange multiplier statistic as , where is the sample size and is the coefficient of determination from this auxiliary regression; under the null hypothesis of homoscedasticity, this statistic follows a distribution with degrees of freedom equal to the number of predictors. Heteroscedasticity does not bias OLS coefficient estimates, which remain unbiased and consistent, but it renders them inefficient by failing to minimize the variance among linear unbiased estimators. More critically, it invalidates the usual formulas for standard errors, leading to unreliable confidence intervals and t-tests for significance; specifically, standard errors may be underestimated in regions of high variance, inflating t-statistics and increasing the risk of Type I errors. Common remedies include weighted least squares (WLS), which minimizes a weighted sum of squared residuals using weights to give greater influence to observations with smaller error variances, thereby restoring efficiency. Alternatively, heteroscedasticity-robust standard errors, such as White's estimator, adjust the covariance matrix of the OLS coefficients to account for unknown forms of heteroscedasticity without altering the point estimates; this involves a sandwich estimator that consistently estimates the variance even under heteroscedasticity. For instance, in a wage prediction model regressing earnings on years of education using U.S. Current Population Survey data, residual plots often reveal a fan-out pattern at higher education levels, where earnings variability increases, confirming heteroscedasticity and necessitating robust adjustments for valid inference.Normality Assumption
In linear regression models, the normality assumption requires that the error terms are independently and identically distributed as normal with mean zero and constant variance , denoted . This assumption underpins the validity of standard inference procedures, including t-tests for individual regression coefficients and F-tests for overall model significance, by ensuring that the sampling distributions of these test statistics follow exact normal or F distributions in finite samples. While the ordinary least squares (OLS) estimators are unbiased and consistent under the weaker Gauss-Markov conditions without requiring normality, the assumption is essential for reliable hypothesis testing and confidence intervals, particularly when deriving exact p-values. To assess adherence to this assumption, analysts examine the residuals , which serve as proxies for the unobserved errors. Graphical diagnostics include histograms of the residuals to visualize their shape against a superimposed normal density curve, and quantile-quantile (Q-Q) plots, which plot ordered residuals against theoretical quantiles from a standard normal distribution; substantial deviations from a straight line indicate non-normality, such as skewness or kurtosis. The Q-Q plot method, developed for data analysis, effectively highlights tail behavior and is widely used in regression diagnostics. Formal tests complement these visuals, with the Shapiro-Wilk test being particularly powerful for small to moderate sample sizes; it computes the W statistic as the squared correlation between ordered residuals and corresponding expected normal order statistics, where W close to 1 supports the null hypothesis of normality, and a low p-value rejects it. This test outperforms alternatives like the Kolmogorov-Smirnov in detecting departures from normality for regression residuals. Violating the normality assumption primarily affects inferential statistics rather than point estimates, as OLS coefficients remain unbiased even under non-normal errors. However, in small samples, non-normality can inflate type I error rates or bias p-values in t- and F-tests, leading to unreliable significance assessments and confidence intervals; for example, skewed residuals may overestimate or underestimate standard errors. In larger samples, the central limit theorem often restores approximate normality in the distribution of estimators, mitigating these issues and making the assumption less stringent for asymptotic inference. Simulations confirm that while severe non-normality impacts small-sample tests, moderate violations have negligible effects on coefficient estimates. Remedies for non-normal residuals focus on restoring approximate normality or bypassing the assumption. Data transformations, such as the Box-Cox power transformation applied to the response variable , adjust the scale to stabilize residuals toward normality; the transformation is for (and for ), with estimated via maximum likelihood to minimize residual variance. Alternative approaches include robust regression techniques, like Huber M-estimation, which downweight outliers insensitive to distribution shape, or non-parametric bootstrap methods to empirically derive inference without assuming normality. These strategies maintain the interpretability of OLS while addressing violations.Goodness of Fit Assessment
Coefficient of Determination (R-squared)
The coefficient of determination, denoted as , quantifies the proportion of the total variance in the response variable that is accounted for by the regression model in linear regression analysis. Introduced by geneticist Sewall Wright in 1921 in his work on correlation and causation,[12] it serves as a key goodness-of-fit metric for assessing how well the model captures the underlying patterns in the data. The formula for is where represents the sum of squared residuals between observed values and predicted values , and is the total sum of squares measuring variance from the mean . This expression arises from partitioning the total sum of squares into explained (regression) and unexplained (residual) components, where equals the ratio of the regression sum of squares to the total sum of squares.[13] In interpretation, ranges from 0 to 1, with a value of 0 indicating that the model explains no variance (equivalent to using the mean as the predictor) and 1 signifying a perfect fit where all variance is explained. An value closer to 1 suggests stronger explanatory power, but it invariably increases—or at least does not decrease—when additional predictors are included, even if they add little explanatory value.[14] This relates to the overall F-statistic for model significance, as higher contributes to a larger F-value under the null hypothesis of no relationship.[15] Despite its utility, has notable limitations: it does not establish causation, as high values can occur in models with spurious correlations; it can be inflated in misspecified models that fail to capture nonlinearity or other violations; and it provides no penalty for overfitting, leading to overly optimistic assessments in complex models with many predictors. These issues highlight the need for complementary diagnostics beyond alone.[14] For example, in a linear regression model estimating housing prices from predictors such as square footage and number of bedrooms, an indicates that 75% of the variation in prices is explained by these features, leaving 25% attributable to other unmodeled factors.[16] The partitioning underlying —where total variance decomposes into explained and residual portions—underpins extensions like adjusted , which penalize for additional predictors to better gauge model parsimony.Adjusted R-squared and Related Metrics
The adjusted R-squared (R²_adj) is a modified version of the coefficient of determination that accounts for the number of predictors in a regression model to provide a more reliable measure of goodness of fit, particularly when comparing models with varying complexity.[17] Its formula is given by: where is the unadjusted coefficient of determination, is the sample size, and is the number of predictors.[17] This adjustment penalizes the inclusion of irrelevant variables by incorporating the degrees of freedom, ensuring that R²_adj increases only if a new predictor substantially improves the model's explanatory power beyond what would be expected by chance; otherwise, it decreases or remains unchanged, promoting parsimonious models.[17] Related metrics for model selection and validation include Mallow's Cp and the Akaike Information Criterion (AIC), both of which balance model fit against complexity in multiple regression settings. Mallow's Cp, introduced by Colin L. Mallows, is calculated as: where RSS_p is the residual sum of squares for the subset model with p parameters, s² is an unbiased estimate of the error variance from the full model, and n is the sample size; models with Cp values close to p indicate good predictive performance without excessive bias or variance.[18] The AIC, proposed by Hirotugu Akaike, provides an estimate of the relative quality of models for prediction, given by: where L is the maximized likelihood of the model and k is the number of parameters; lower AIC values favor models that achieve adequate fit with fewer parameters, aiding in avoiding overfitting.[19] In practice, adjusted R-squared is preferred over the unadjusted R² when evaluating models because R²_adj ≤ R², with the former being higher for superior models that explain variance efficiently after penalizing complexity—for instance, in a multiple regression with n=100 observations, k=10 predictors, and R²=0.80, the adjusted value is approximately 0.78, signaling that some predictors may not justify their inclusion.[17]Overall Model Significance Tests
The overall model significance in linear regression is assessed using the F-test for overall fit, which determines whether at least one predictor variable contributes significantly to explaining the variation in the response variable, beyond a model consisting solely of the mean response. This test compares the fit of the full regression model to the null model under the assumption of normally distributed errors with constant variance.[20] The null hypothesis states that all slope coefficients are zero, i.e., for , where is the number of predictors; the intercept is not included in this hypothesis as it represents the mean response under the null.[15] The alternative hypothesis is that at least one . The test statistic is where is the sum of squares due to regression, is the residual sum of squares, is the number of observations, and the statistic follows an F-distribution with and degrees of freedom under the null hypothesis.[20] A p-value below a chosen significance level (e.g., 0.05) rejects the null, indicating that the model as a whole explains a statistically significant portion of the variance in the response variable.[15] The F-statistic is mathematically equivalent to the coefficient of determination via the relation allowing the test to evaluate the statistical reliability of as a measure of model explanatory power.[21] For instance, in a multiple regression model predicting sales using 5 predictors, an F-statistic of 15.2 with a p-value less than 0.001 would reject the null hypothesis, confirming the model's overall significance.[21]Residual Diagnostics
Visual Inspection of Residuals
Visual inspection of residuals is a fundamental diagnostic technique in regression analysis, allowing analysts to graphically identify patterns that may indicate model misspecification or violations of underlying assumptions. By plotting residuals—the differences between observed and predicted values—against fitted values, predictors, or theoretical distributions, potential issues such as nonlinearity, heteroscedasticity, or non-normality become apparent through non-random patterns. This approach provides an intuitive, preliminary assessment before formal statistical tests, enabling model refinement.[22] One of the primary plots is the residuals versus fitted values, which scatters residuals on the y-axis against predicted values on the x-axis to check for linearity and constant variance. An ideal plot shows a random scatter of points around the horizontal line at zero, with no discernible trends or patterns; a curved shape suggests nonlinearity in the relationship, while a funnel-like spread indicates heteroscedasticity, where residual variance changes with fitted values.[23][24] Similarly, residuals versus each predictor plot residuals against individual independent variables to detect nonlinearity specific to those predictors; random scatter is desirable, but systematic curves or clusters signal the need for transformations or additional terms like polynomials.[25] To assess normality of residuals, the quantile-quantile (Q-Q) plot compares the ordered standardized residuals against theoretical quantiles from a normal distribution, with points ideally aligning along a straight diagonal line. Deviations at the tails suggest heavy or light-tailed distributions, while S-shaped curves indicate skewness.[26] For a focused check on heteroscedasticity, the scale-location plot graphs the square root of the absolute standardized residuals against fitted values; a horizontal line with random scatter around it confirms constant variance, whereas an upward or downward trend reveals increasing or decreasing spread.[27] In all cases, the absence of patterns affirms model adequacy, while detected issues guide adjustments like variable transformations or alternative model forms.[28] These diagnostic plots are readily generated in statistical software. In R, the base functionplot(lm_object) automatically produces a suite of residual plots, including residuals vs. fitted, Q-Q, scale-location, and residuals vs. leverage, facilitating quick inspection.[29] In Python, the statsmodels library offers functions like plot_regress_exog for residuals versus predictors and built-in plotting methods for fitted values and Q-Q plots to visualize diagnostics.[30]
For instance, in longitudinal data analysis, plotting residuals against time can uncover temporal trends or autocorrelation; a desirable random scatter supports independence, but upward or downward drifts indicate unmodeled time dependencies, prompting inclusion of time-based covariates or mixed-effects models.[31] Overall, these visual tools verify core regression assumptions by highlighting deviations in an accessible manner.[32]
