Hubbry Logo
Regression validationRegression validationMain
Open search
Regression validation
Community hub
Regression validation
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Regression validation
Regression validation
from Wikipedia

In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.

Goodness of fit

[edit]

One measure of goodness of fit is the coefficient of determination, often denoted, R2. In ordinary least squares with an intercept, it ranges between 0 and 1. However, an R2 close to 1 does not guarantee that the model fits the data well. For example, if the functional form of the model does not match the data, R2 can be high despite a poor model fit. Anscombe's quartet consists of four example data sets with similarly high R2 values, but data that sometimes clearly does not fit the regression line. Instead, the data sets include outliers, high-leverage points, or non-linearities.

One problem with the R2 as a measure of model validity is that it can always be increased by adding more variables into the model, except in the unlikely event that the additional variables are exactly uncorrelated with the dependent variable in the data sample being used. This problem can be avoided by doing an F-test of the statistical significance of the increase in the R2, or by instead using the adjusted R2.

Analysis of residuals

[edit]

The residuals from a fitted model are the differences between the responses observed at each combination of values of the explanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically, the definition of the residual for the ith observation in the data set is written

plot of a fit and residuals to illustrate how plotting residuals allows us to evaluate how good a fit is
An illustrative plot of a fit to data (green curve in top panel, data in red) plus a plot of residuals: red points in bottom plot. Dashed curve in bottom panel is a straight line fit to the residuals. If the functional form is correct then there should be little or no trend to the residuals - as seen here.

with yi denoting the ith response in the data set and xi the vector of explanatory variables, each set at the corresponding values found in the ith observation in the data set.

If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. The next section details the types of plots to use to test different aspects of a model and gives the correct interpretations of different results that could be observed for each type of plot.

Graphical analysis of residuals

[edit]

A basic, though not quantitatively precise, way to check for problems that render a model inadequate is to conduct a visual examination of the residuals (the mispredictions of the data used in quantifying the model) to look for obvious deviations from randomness. If a visual examination suggests, for example, the possible presence of heteroscedasticity (a relationship between the variance of the model errors and the size of an independent variable's observations), then statistical tests can be performed to confirm or reject this hunch; if it is confirmed, different modeling procedures are called for.

Different types of plots of the residuals from a fitted model provide information on the adequacy of different aspects of the model.

  1. sufficiency of the functional part of the model: scatter plots of residuals versus predictors
  2. non-constant variation across the data: scatter plots of residuals versus predictors; for data collected over time, also plots of residuals against time
  3. drift in the errors (data collected over time): run charts of the response and errors versus time
  4. independence of errors: lag plot
  5. normality of errors: histogram and normal probability plot

Graphical methods have an advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between the model and the data.

Quantitative analysis of residuals

[edit]

Numerical methods also play an important role in model validation. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. One common situation when numerical validation methods take precedence over graphical methods is when the number of parameters being estimated is relatively close to the size of the data set. In this situation residual plots are often difficult to interpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area in which this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area in which graphical residual analysis can be difficult.

Serial correlation of the residuals can indicate model misspecification, and can be checked for with the Durbin–Watson statistic. The problem of heteroskedasticity can be checked for in any of several ways.

Out-of-sample evaluation

[edit]

Cross-validation is the process of assessing how the results of a statistical analysis will generalize to an independent data set. If the model has been estimated over some, but not all, of the available data, then the model using the estimated parameters can be used to predict the held-back data. If, for example, the out-of-sample mean squared error, also known as the mean squared prediction error, is substantially higher than the in-sample mean square error, this is a sign of deficiency in the model.

A development in medical statistics is the use of out-of-sample cross validation techniques in meta-analysis. It forms the basis of the validation statistic, Vn, which is used to test the statistical validity of meta-analysis summary estimates. Essentially it measures a type of normalized prediction error and its distribution is a linear combination of χ2 variables of degree 1. [1]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Regression validation is the process of evaluating the adequacy, reliability, and generalizability of a regression model to ensure it accurately represents the underlying relationship between predictor and response variables, involving checks on model assumptions, fit, and predictive performance. In statistical modeling, this step confirms that the model is not only statistically significant but also practically useful, preventing issues like or violation of assumptions such as , , homoscedasticity, and normality of residuals. Key techniques in regression validation include graphical residual analysis, which examines plots of residuals (observed minus predicted values) to detect patterns indicating poor fit, such as non-random scatter or outliers, as random residuals suggest the model captures the data's structure adequately. Numerical methods, like the lack-of-fit test, complement these by formally testing the model's functional form adequacy by comparing residuals to pure error from replicates, particularly useful when replicate observations are available and residual plots are ambiguous. Cross-validation methods, such as k-fold cross-validation—where data is split into k subsets, training on k-1 and validating on the held-out portion, then averaging errors—and leave-one-out cross-validation (LOOCV), which iteratively leaves out single observations, provide robust estimates of predictive accuracy by simulating performance on unseen data. Additional aspects involve assessing model stability through data splitting or to verify reliability and generalizability, with sample sizes calculated to achieve sufficient power (e.g., in validating a for fetal weight estimation, at least 173 observations are required for 80% power (α=0.05) using the exact method under the model's parameters). These techniques collectively ensure the regression model's coefficients and predictions align with theoretical expectations and perform well beyond the training dataset, making validation essential for applications in fields like , , and .

Core Assumptions in Regression

Linearity Assumption

In linear regression models, the linearity assumption requires that the of the response variable is a of the predictor variables. This principle underlies both , modeled as E(Yi)=β0+β1xiE(Y_i) = \beta_0 + \beta_1 x_i, and multiple linear regression, expressed as E(Yi)=β0+β1x1i++βpxpiE(Y_i) = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}. The assumption encompasses additivity, where the effects of predictors on the response are independent and combine linearly without interactions or curvature influencing the slope. To check the assumption, diagnostic plots are commonly used. Scatter plots of the observed response against each predictor provide an initial visual assessment of linear trends. Residuals versus fitted values or versus individual predictors should exhibit a random scatter around zero with no discernible patterns, such as bows or curves, which would signal nonlinearity. In multiple regression settings, component-plus-residual plots (or partial residual plots) help evaluate the linear contribution of each predictor after accounting for the others. Violation of the linearity assumption leads to biased estimates, diminished predictive accuracy, and compromised , including unreliable p-values for tests. Nonlinear patterns can cause systematic errors in predictions, especially when extrapolating beyond the range of observed data. Remedies for addressing nonlinearity include augmenting the model with terms, such as quadratic components like β2xi2\beta_2 x_i^2, to model . Transformations of variables, including logarithmic (e.g., log(Y)\log(Y)) or square root functions, can often restore by stabilizing exponential or skewed relationships. For more pronounced nonlinearity, adopting techniques may be required instead of forcing a . As an example, consider the simple yi=β0+β1xi+ϵiy_i = \beta_0 + \beta_1 x_i + \epsilon_i. is evaluated by plotting residuals against fitted values; a pattern-free cloud of points centered on the zero line affirms the assumption, while any systematic curvature suggests refitting with polynomials or transformations.

Independence Assumption

In models, the independence assumption requires that the error terms ϵi\epsilon_i for different observations are uncorrelated, formally expressed as E(ϵiϵj)=0E(\epsilon_i \epsilon_j) = 0 for all iji \neq j. This assumption ensures that the residuals do not exhibit systematic patterns of dependence, allowing the ordinary least squares (OLS) estimator to produce unbiased and efficient inferences under the Gauss-Markov theorem. Violations of this assumption arise from various data structures, including autocorrelation where errors in sequential observations are positively or negatively correlated due to temporal trends; spatial dependence, in which nearby geographic units influence each other, leading to correlated residuals; and clustered sampling, such as in multi-center studies where observations within the same group (e.g., hospitals or schools) share unmodeled similarities quantified by an coefficient (ICC) greater than zero. A primary method for detecting violations, particularly first-order autocorrelation in time-ordered data, is the Durbin-Watson test, introduced by Durbin and Watson. The test statistic is calculated as: DW=t=2n(etet1)2t=1net2DW = \frac{\sum_{t=2}^{n} (e_t - e_{t-1})^2}{\sum_{t=1}^{n} e_t^2} where ete_t are the OLS residuals and nn is the number of observations. The DW statistic ranges from 0 to 4, with a value near 2 indicating no first-order ; values below 2 suggest positive autocorrelation (errors tend to persist), while values above 2 indicate negative autocorrelation (errors alternate in sign). Critical values dLd_L and dUd_U from significance tables are used for hypothesis testing: if DW < dLd_L, reject the null hypothesis of no autocorrelation; if DW > dUd_U, fail to reject; otherwise, the result is inconclusive. Violating the independence assumption leads to underestimated standard errors of the coefficient estimates, as the model fails to account for the reduced effective sample size due to dependence. This underestimation inflates t-statistics, resulting in inflated Type I error rates for significance tests—potentially up to 30% or higher at moderate ICC levels like 0.3—and poor coverage of intervals (e.g., dropping to 71% at ICC=0.5). Overall, these issues render tests unreliable and bias inferences about predictor effects. Remedies for dependence include (GLS), which transforms the model to account for the correlation structure in the errors, yielding efficient estimates. For autocorrelation, adding lagged dependent or independent variables can model the serial dependence explicitly, as in autoregressive (AR) specifications. In cases of clustering or hierarchical data, mixed-effects models incorporate random effects to capture group-level variation, restoring valid inference.

Homoscedasticity Assumption

In , the homoscedasticity assumption requires that the variance of the error terms, or residuals, remains constant across all levels of the predictor variables. This is formally stated as Var(ϵiXi)=σ2\operatorname{Var}(\epsilon_i \mid X_i) = \sigma^2, where σ2\sigma^2 is a positive constant independent of the values taken by the predictors XiX_i. This assumption is one of the core conditions of the , ensuring that ordinary (OLS) estimators achieve the best (BLUE) properties under the Gauss-Markov . Violation of homoscedasticity, known as heteroscedasticity, occurs when the error variance changes systematically with the predictors, such as increasing with higher values of XX. To detect heteroscedasticity, analysts commonly inspect residual plots, where residuals are graphed against fitted values or predictors; a funnel-shaped pattern, with residuals spreading out as fitted values increase, signals non-constant variance. A formal statistical test is the Breusch-Pagan test, which involves regressing the squared residuals from the original model on the predictors and computing the statistic as nR2n R^2, where nn is the sample size and R2R^2 is the from this auxiliary regression; under the of homoscedasticity, this statistic follows a χ2\chi^2 distribution with equal to the number of predictors. Heteroscedasticity does not bias OLS coefficient estimates, which remain unbiased and consistent, but it renders them inefficient by failing to minimize the variance among linear unbiased s. More critically, it invalidates the usual formulas for standard errors, leading to unreliable confidence intervals and t-tests for significance; specifically, standard errors may be underestimated in regions of high variance, inflating t-statistics and increasing the of Type I errors. Common remedies include (WLS), which minimizes a weighted sum of squared residuals using weights wi=1/Var(ϵi)w_i = 1 / \operatorname{Var}(\epsilon_i) to give greater influence to observations with smaller error variances, thereby restoring efficiency. Alternatively, heteroscedasticity-robust standard errors, such as estimator, adjust the of the OLS coefficients to account for unknown forms of heteroscedasticity without altering the point estimates; this involves a sandwich estimator that consistently estimates the variance even under heteroscedasticity. For instance, in a prediction model regressing earnings on years of education using U.S. , residual plots often reveal a pattern at higher education levels, where earnings variability increases, confirming heteroscedasticity and necessitating robust adjustments for valid .

Normality Assumption

In models, the normality assumption requires that the error terms ϵi\epsilon_i are independently and identically distributed as normal with zero and constant variance σ2\sigma^2, denoted ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2). This assumption underpins the validity of standard procedures, including t-tests for individual regression coefficients and F-tests for overall model significance, by ensuring that the sampling distributions of these test statistics follow normal or F distributions in finite samples. While the ordinary least squares (OLS) estimators are unbiased and consistent under the weaker Gauss-Markov conditions without requiring normality, the assumption is essential for reliable testing and confidence intervals, particularly when deriving p-values. To assess adherence to this assumption, analysts examine the residuals ei=yiy^ie_i = y_i - \hat{y}_i, which serve as proxies for the unobserved errors. Graphical diagnostics include histograms of the residuals to visualize their shape against a superimposed normal density curve, and quantile-quantile (Q-Q) plots, which plot ordered residuals against theoretical quantiles from a standard ; substantial deviations from a straight line indicate non-normality, such as or . The Q-Q plot method, developed for , effectively highlights tail behavior and is widely used in regression diagnostics. Formal tests complement these visuals, with the Shapiro-Wilk test being particularly powerful for small to moderate sample sizes; it computes the W statistic as the squared between ordered residuals and corresponding expected normal order statistics, where W close to 1 supports the of normality, and a low rejects it. This test outperforms alternatives like the Kolmogorov-Smirnov in detecting departures from normality for regression residuals. Violating the normality assumption primarily affects inferential statistics rather than point estimates, as OLS coefficients remain unbiased even under non-normal errors. However, in small samples, non-normality can inflate type I error rates or bias p-values in t- and F-tests, leading to unreliable significance assessments and confidence intervals; for example, skewed residuals may overestimate or underestimate standard errors. In larger samples, the often restores approximate normality in the distribution of estimators, mitigating these issues and making the assumption less stringent for asymptotic . Simulations confirm that while severe non-normality impacts small-sample tests, moderate violations have negligible effects on estimates. Remedies for non-normal residuals focus on restoring approximate normality or bypassing the assumption. Data transformations, such as the Box-Cox power transformation applied to the response variable yy, adjust the scale to stabilize residuals toward normality; the transformation is y(λ)=yλ1λy^{(\lambda)} = \frac{y^\lambda - 1}{\lambda} for λ0\lambda \neq 0 (and logy\log y for λ=0\lambda = 0), with λ\lambda estimated via maximum likelihood to minimize residual variance. Alternative approaches include techniques, like Huber M-estimation, which downweight outliers insensitive to distribution shape, or non-parametric bootstrap methods to empirically derive without assuming normality. These strategies maintain the interpretability of OLS while addressing violations.

Goodness of Fit Assessment

Coefficient of Determination (R-squared)

The coefficient of determination, denoted as R2R^2, quantifies the proportion of the total variance in the response variable that is accounted for by the regression model in linear regression analysis. Introduced by geneticist Sewall Wright in 1921 in his work on correlation and causation, it serves as a key goodness-of-fit metric for assessing how well the model captures the underlying patterns in the data. The formula for R2R^2 is R2=1SSresSStotR^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} where SSres=i=1n(yiy^i)2SS_{\text{res}} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 represents the sum of squared residuals between observed values yiy_i and predicted values y^i\hat{y}_i, and SStot=i=1n(yiyˉ)2SS_{\text{tot}} = \sum_{i=1}^n (y_i - \bar{y})^2 is the measuring variance from the yˉ\bar{y}. This expression arises from partitioning the into explained (regression) and unexplained (residual) components, where R2R^2 equals the ratio of the regression sum of squares to the . In interpretation, R2R^2 ranges from 0 to 1, with a value of 0 indicating that the model explains no variance (equivalent to using the as the predictor) and 1 signifying a perfect fit where all variance is explained. An R2R^2 value closer to 1 suggests stronger , but it invariably increases—or at least does not decrease—when additional predictors are included, even if they add little explanatory value. This relates to the overall F-statistic for model significance, as higher R2R^2 contributes to a larger F-value under the of no relationship. Despite its utility, R2R^2 has notable limitations: it does not establish causation, as high values can occur in models with spurious correlations; it can be inflated in misspecified models that fail to capture nonlinearity or other violations; and it provides no penalty for , leading to overly optimistic assessments in complex models with many predictors. These issues highlight the need for complementary diagnostics beyond R2R^2 alone. For example, in a model estimating housing prices from predictors such as square footage and number of bedrooms, an R2=0.75R^2 = 0.75 indicates that 75% of the variation in prices is explained by these features, leaving 25% attributable to other unmodeled factors. The partitioning underlying R2R^2—where total variance decomposes into explained and residual portions—underpins extensions like adjusted R2R^2, which penalize for additional predictors to better gauge model parsimony. The adjusted R-squared (R²_adj) is a modified version of the that accounts for the number of predictors in a regression model to provide a more reliable measure of , particularly when comparing models with varying complexity. Its formula is given by: Radj2=1(1R2)(n1)nk1R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n-1)}{n - k - 1} where R2R^2 is the unadjusted coefficient of determination, nn is the sample size, and kk is the number of predictors. This adjustment penalizes the inclusion of irrelevant variables by incorporating the , ensuring that R²_adj increases only if a new predictor substantially improves the model's explanatory power beyond what would be expected by chance; otherwise, it decreases or remains unchanged, promoting parsimonious models. Related metrics for and validation include Mallow's Cp and the (AIC), both of which balance model fit against complexity in multiple regression settings. Mallow's Cp, introduced by Colin L. Mallows, is calculated as: Cp=RSSps2(n2p)C_p = \frac{\text{RSS}_p}{s^2} - (n - 2p) where RSS_p is the residual sum of squares for the subset model with p parameters, s² is an unbiased estimate of the error variance from the full model, and n is the sample size; models with Cp values close to p indicate good predictive performance without excessive bias or variance. The AIC, proposed by Hirotugu Akaike, provides an estimate of the relative quality of models for prediction, given by: AIC=2log(L)+2k\text{AIC} = -2 \log(L) + 2k where L is the maximized likelihood of the model and k is the number of parameters; lower AIC values favor models that achieve adequate fit with fewer parameters, aiding in avoiding overfitting. In practice, adjusted R-squared is preferred over the unadjusted R² when evaluating models because R²_adj ≤ R², with the former being higher for superior models that explain variance efficiently after penalizing complexity—for instance, in a multiple regression with n=100 observations, k=10 predictors, and R²=0.80, the adjusted value is approximately 0.78, signaling that some predictors may not justify their inclusion.

Overall Model Significance Tests

The overall model significance in is assessed using the for overall fit, which determines whether at least one predictor variable contributes significantly to explaining the variation in the response variable, beyond a model consisting solely of the mean response. This test compares the fit of the full regression model to the null model under the assumption of normally distributed errors with constant variance. The states that all slope coefficients are zero, i.e., H0:βj=0H_0: \beta_j = 0 for j=1,,kj = 1, \dots, k, where kk is the number of predictors; the intercept β0\beta_0 is not included in this hypothesis as it represents the response under the null. The is that at least one βj0\beta_j \neq 0. The is F=SSreg/kSSres/(nk1),F = \frac{SS_{\mathrm{reg}} / k}{SS_{\mathrm{res}} / (n - k - 1)}, where SSregSS_{\mathrm{reg}} is the sum of squares due to regression, SSresSS_{\mathrm{res}} is the residual sum of squares, nn is the number of observations, and the statistic follows an F-distribution with kk and nk1n - k - 1 degrees of freedom under the null hypothesis. A p-value below a chosen significance level (e.g., 0.05) rejects the null, indicating that the model as a whole explains a statistically significant portion of the variance in the response variable. The F-statistic is mathematically equivalent to the coefficient of determination R2R^2 via the relation F=R2/k(1R2)/(nk1),F = \frac{R^2 / k}{(1 - R^2) / (n - k - 1)}, allowing the test to evaluate the statistical reliability of R2R^2 as a measure of model . For instance, in a multiple regression model predicting using 5 predictors, an F-statistic of 15.2 with a less than 0.001 would reject the , confirming the model's overall significance.

Residual Diagnostics

Visual Inspection of Residuals

Visual inspection of residuals is a fundamental diagnostic technique in regression analysis, allowing analysts to graphically identify patterns that may indicate model misspecification or violations of underlying assumptions. By plotting residuals—the differences between observed and predicted values—against fitted values, predictors, or theoretical distributions, potential issues such as nonlinearity, heteroscedasticity, or non-normality become apparent through non-random patterns. This approach provides an intuitive, preliminary assessment before formal statistical tests, enabling model refinement. One of the primary plots is the residuals versus fitted values, which scatters residuals on the y-axis against predicted values on the x-axis to check for and constant variance. An ideal plot shows a random scatter of points around the horizontal line at zero, with no discernible trends or patterns; a curved shape suggests nonlinearity in the relationship, while a funnel-like spread indicates heteroscedasticity, where residual variance changes with fitted values. Similarly, residuals versus each predictor plot residuals against individual independent variables to detect nonlinearity specific to those predictors; random scatter is desirable, but systematic curves or clusters signal the need for transformations or additional terms like polynomials. To assess normality of residuals, the quantile-quantile (Q-Q) compares the ordered standardized residuals against theoretical quantiles from a , with points ideally aligning along a straight diagonal line. Deviations at the tails suggest heavy or light-tailed distributions, while S-shaped curves indicate . For a focused check on heteroscedasticity, the scale-location graphs the square root of the absolute standardized residuals against fitted values; a horizontal line with random scatter around it confirms constant variance, whereas an upward or downward trend reveals increasing or decreasing spread. In all cases, the absence of patterns affirms model adequacy, while detected issues guide adjustments like variable transformations or forms. These diagnostic plots are readily generated in statistical software. In , the base function plot(lm_object) automatically produces a suite of residual plots, including residuals vs. fitted, Q-Q, scale-location, and residuals vs. leverage, facilitating quick inspection. In Python, the statsmodels library offers functions like plot_regress_exog for residuals versus predictors and built-in plotting methods for fitted values and Q-Q plots to visualize diagnostics. For instance, in longitudinal , plotting residuals against time can uncover temporal trends or ; a desirable random scatter supports , but upward or downward drifts indicate unmodeled time dependencies, prompting inclusion of time-based covariates or mixed-effects models. Overall, these visual tools verify core regression assumptions by highlighting deviations in an accessible manner.

Statistical Tests on Residuals

Statistical tests on residuals provide formal, quantitative assessments of whether the residuals from a regression model satisfy key assumptions, such as normality, , and homoscedasticity, using p-values to determine significance. These tests complement visual diagnostics by offering objective criteria for model validation, with rejection of the indicating violations that may require model adjustments like transformations or robust standard errors. To evaluate the normality assumption, the Shapiro-Wilk computes a WW that measures the between the ordered residuals and expected values from a , where WW ranges from 0 to 1, with values closer to 1 supporting normality; a considers W>0.9W > 0.9 as acceptable for small to moderate sample sizes, though formal relies on the associated . The is particularly powerful for samples up to 50 observations. Another common for normality is the Jarque-Bera , which assesses deviations in (SS) and (KK) from normal values of 0 and 3, respectively, via the JB=n6(S2+(K3)24),JB = \frac{n}{6} \left( S^2 + \frac{(K - 3)^2}{4} \right), distributed asymptotically as χ2(2)\chi^2(2) under the null of normality; a low p-value rejects normality, often signaling the need for generalized linear models in non-normal cases. For detecting autocorrelation in residuals, particularly in time-series regressions, the Durbin-Watson test examines first-order serial correlation using the statistic DW=t=2n(etet1)2t=1net2,DW = \frac{\sum_{t=2}^{n} (e_t - e_{t-1})^2}{\sum_{t=1}^{n} e_t^2}, which ranges from 0 to 4; values near 2 indicate no , below 1.5 suggest positive autocorrelation, and above 2.5 indicate negative autocorrelation, with critical values depending on sample size and predictors for testing. The test assumes no lagged dependent variables and is inconclusive in some regions, prompting alternatives like the Breusch-Godfrey test for higher-order checks. Heteroscedasticity, or varying residual variance, is tested using the Breusch-Pagan procedure, which regresses squared residuals on the predictors and computes a asymptotically distributed as χ2(k)\chi^2(k), where kk is the number of predictors; a significant rejects constant variance. The extends this by including squared and cross-product terms of predictors in the auxiliary regression, yielding a more general χ2\chi^2 robust to unknown heteroscedasticity forms, though it has lower power against specific patterns. Multicollinearity among predictors can inflate the variances of estimates, detected through variance inflation factors (VIF) for each predictor jj, calculated as VIFj=11Rj2VIF_j = \frac{1}{1 - R_j^2}, where Rj2R_j^2 is the from regressing predictor jj on all others; VIF values exceeding 10 signal problematic , potentially destabilizing estimates and increasing standard errors. Although not a direct test on residuals, VIF assesses predictor correlations as part of overall model diagnostics. For instance, in a regression model of financial returns on market factors, the Breusch-Pagan test might produce a χ2\chi^2 statistic with p = 0.03, rejecting homoscedasticity and suggesting the use of heteroscedasticity-consistent covariance estimators. These tests confirm patterns observed in residual plots, enabling rigorous model refinement.

Predictive Validation Techniques

In-Sample Evaluation

In-sample evaluation in involves assessing model performance using the dataset that was also used to fit the model, providing initial insights into fit quality but often yielding overly favorable results due to the lack of separation between training and testing. Common metrics include the (R²), which quantifies the proportion of variance in the response variable explained by the model on the training data, and the (MSE), calculated as the average of squared residuals between observed and fitted values. These measures build on goodness-of-fit assessments by offering straightforward summaries of in-sample accuracy. A specialized method for in-sample is the Predicted Residual Sum of Squares (PRESS), introduced by Allen as a criterion for variable selection and model assessment. PRESS is defined as PRESS=i=1nei2,\text{PRESS} = \sum_{i=1}^n e_i^2, where ei=yiy^ie_i = y_i - \hat{y}_{-i} represents the predicted residual for the ii-th observation, and y^i\hat{y}_{-i} is the fitted value obtained by refitting the model excluding that observation (a leave-one-out ). This statistic approximates the model's predictive error without requiring partitioning, making it suitable for smaller datasets, though can be intensive for large samples. In-sample R² and MSE serve as quick diagnostics for comparing multiple models during the fitting process, allowing practitioners to identify candidates with strong apparent on the available data before more rigorous testing. For instance, higher R² values or lower MSE indicate better relative fit among options, facilitating efficient model refinement. However, these metrics are prone to , as the model is tuned directly to the training data, potentially capturing idiosyncratic noise and leading to inflated estimates of true . The primary limitation of in-sample evaluation lies in its inability to reliably predict generalization; a model exhibiting excellent in-sample fit, such as a high R² close to 1, may fail dramatically on new due to . This disconnect underscores the need for complementary validation techniques, as in-sample success alone cannot verify robustness beyond the training set. As an illustrative example, consider a model fitted to a of 100 observations where the in-sample MSE equals 10; this low value suggests good alignment with the training , but it provides no assurance against poorer performance elsewhere without additional checks.

Out-of-Sample and Cross-Validation Methods

Out-of-sample validation techniques assess a regression model's predictive performance on not used during , providing a more reliable estimate of compared to in-sample metrics by mitigating . One straightforward approach is the train-test split, where the is randomly divided into a subset (typically 70-80% of the ) used to fit the model and a held-out test subset (20-30%) reserved for evaluation. Performance is then measured on the test set using metrics such as out-of-sample R² or root mean squared error (RMSE), which quantify how well the model predicts unseen observations. This method is simple and computationally efficient but can be sensitive to the specific split, potentially leading to high variance in estimates if the is small. To address the variability of a single split, k-fold cross-validation partitions the data into k equally sized folds, training the model k times—each time using k-1 folds for training and the remaining fold for validation—then averaging the performance across all folds. The cross-validation RMSE, a common metric for regression, is given by CV RMSE=1kj=1kMSEj,\text{CV RMSE} = \sqrt{\frac{1}{k} \sum_{j=1}^k \text{MSE}_j},
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.