Statistical model validation

In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misunderstanding by researchers of the actual relevance of their model. To combat this, model validation is used to test whether a statistical model can hold up to permutations in the data. Model validation is also called model criticism or model evaluation.

This topic is not to be confused with the closely related task of model selection, the process of discriminating between multiple candidate models: model validation does not concern so much the conceptual design of models as it tests only the consistency between a chosen model and its stated outputs.

There are many ways to validate a model. Residual plots plot the difference between the actual data and the model's predictions: correlations in the residual plots may indicate a flaw in the model. Cross validation is a method of model validation that iteratively refits the model, each time leaving out just a small sample and comparing whether the samples left out are predicted by the model: there are many kinds of cross validation. Predictive simulation is used to compare simulated data to actual data. External validation involves fitting the model to new data. Akaike information criterion estimates the quality of a model.

Overview

Model validation comes in many forms and the specific method of model validation a researcher uses is often a constraint of their research design. To emphasize, what this means is that there is no one-size-fits-all method to validating a model. For example, if a researcher is operating with a very limited set of data, but data they have strong prior assumptions about, they may consider validating the fit of their model by using a Bayesian framework and testing the fit of their model using various prior distributions. However, if a researcher has a lot of data and is testing multiple nested models, these conditions may lend themselves toward cross validation and possibly a leave one out test. These are two abstract examples and any actual model validation will have to consider far more intricacies than describes here but these example illustrate that model validation methods are always going to be circumstantial.

In general, models can be validated using existing data or with new data, and both methods are discussed more in the following subsections, and a note of caution is provided, too.

Validation with existing data

Validation based on existing data involves analyzing the goodness of fit of the model or analyzing whether the residuals seem to be random (i.e. residual diagnostics). This method involves using analyses of the models closeness to the data and trying to understand how well the model predicts its own data. One example of this method is in Figure 1, which shows a polynomial function fit to some data. We see that the polynomial function does not conform well to the data, which appears linear, and might invalidate this polynomial model.

Commonly, statistical models on existing data are validated using a validation set, which may also be referred to as a holdout set. A validation set is a set of data points that the user leaves out when fitting a statistical model. After the statistical model is fitted, the validation set is used as a measure of the model's error. If the model fits well on the initial data but has a large error on the validation set, this is a sign of overfitting.

Validation with new data

If new data becomes available, an existing model can be validated by assessing whether the new data is predicted by the old model. If the new data is not predicted by the old model, then the model might not be valid for the researcher's goals.

With this in mind, a modern approach is to validate a neural network is to test its performance on domain-shifted data. This ascertains if the model learned domain-invariant features.^[1]

A note of caution

A model can be validated only relative to some application area.^[2]^[3] A model that is valid for one application might be invalid for some other applications. As an example, consider the curve in Figure 1: if the application only used inputs from the interval [0, 2], then the curve might well be an acceptable model.

Methods for validating

When doing a validation, there are three notable causes of potential difficulty, according to the Encyclopedia of Statistical Sciences.^[4] The three causes are these: lack of data; lack of control of the input variables; uncertainty about the underlying probability distributions and correlations. The usual methods for dealing with difficulties in validation include the following: checking the assumptions made in constructing the model; examining the available data and related model outputs; applying expert judgment.^[2] Note that expert judgment commonly requires expertise in the application area.^[2]

Expert judgment can sometimes be used to assess the validity of a prediction without obtaining real data: e.g. for the curve in Figure 1, an expert might well be able to assess that a substantial extrapolation will be invalid. Additionally, expert judgment can be used in Turing-type tests, where experts are presented with both real data and related model outputs and then asked to distinguish between the two.^[5]

For some classes of statistical models, specialized methods of performing validation are available. As an example, if the statistical model was obtained via a regression, then specialized analyses for regression model validation exist and are generally employed.

Residual diagnostics

Residual diagnostics involve the analysis of a model's residuals to check if they appear effectively random and consistent with the model's assumptions . For regression analysis, this is a critical step in assessing the goodness-of-fit and reliability of the model.

The core assumptions for a good regression model include:

Zero mean: The residuals should be centered around zero, indicating no systematic bias in the predictions.
Constant variance (homoscedasticity): The spread of the residuals should be consistent across all predicted values.
Independence: Each residual should be independent of the others. For time series data, this means there should be no autocorrelation (serial correlation).
Normality: For making statistical inferences, the residuals are assumed to be approximately normally distributed.^[6]

Common diagnostic plots for regression

Visualizing residuals is one of the most powerful diagnostic techniques. Statistical software like R and Python can generate these plots automatically after fitting a regression model.

1. Residuals vs. Fitted Values plot

This scatter plot displays the residuals on the vertical axis and the model's fitted (predicted) values on the horizontal axis.

Ideal pattern: A horizontal band of points randomly scattered around the line indicates a good fit. This suggests the variance is constant and there are no underlying non-linear patterns left unexplained by the model.
Curved pattern: A non-linear shape, such as a U-shape, indicates that the relationship between the variables is not linear. This suggests that a non-linear term (e.g., a quadratic term) or a data transformation may be necessary.
Fanning or cone-shaped pattern: A pattern where the spread of the residuals widens or narrows as the fitted values increase indicates non-constant variance (heteroscedasticity).^[7]

2. Normal Q-Q (Quantile-Quantile) plot

This plot compares the distribution of the standardized residuals to the theoretical normal distribution.

Ideal pattern: If the residuals are normally distributed, the points will fall closely along a straight diagonal line.
Deviations from the line: S-shaped curves, inverted S-shapes, or heavy tails indicate that the residuals are not normally distributed, which can affect the validity of statistical tests.^[8]

3. Scale-Location plot

This plot is similar to the residuals vs. fitted plot but uses the square root of the standardized residuals on the vertical axis to help visualize non-constant variance. The plot checks the assumption of equal variance (homoscedasticity) more clearly.

Ideal pattern: A horizontal line with points scattered randomly around it.
Patterned deviations: Non-horizontal lines or non-random scattering indicate a violation of the constant variance assumption.^[9]

4. Residuals vs. Leverage plot

This plot helps identify influential cases that have a disproportionate impact on the regression model. Leverage measures how far an observation's independent variable values are from the average. The plot also includes contours for Cook's distance, a measure of influence.

Influential points: Data points with high leverage and high residuals (large Cook's distance) can dramatically change the slope of the regression line if removed. These points appear outside the dashed contour lines on the plot.

Steps for performing a residual diagnostic

Fit the model: Run your regression analysis to generate predicted values and calculate the residuals ().
Generate plots: Create the set of standard diagnostic plots, including Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage.
Inspect the plots: Systematically examine each plot for patterns that violate the model's assumptions. Use the "ideal patterns" described above as a benchmark.
Test for autocorrelation (for time series): Plot the autocorrelation function (ACF) of the residuals to check for any significant correlations over time. You can also perform a formal Ljung-Box test.
Address any issues: If the diagnostics reveal problems, consider the following actions:
- Add non-linear terms: If you see a curved pattern, add quadratic or cubic terms to the model.
- Transform the variables: Apply transformations (e.g., log, square root) to stabilize variance or linearize the relationships.
- Investigate outliers: Check for data entry errors or unusual events that caused a point to be an outlier. Consider removing or adjusting it if justified.
- Use weighted least squares: For heteroscedasticity, a method that gives less weight to observations with higher variance may improve the model.
Re-run diagnostics: After making adjustments, repeat the diagnostic process to confirm that the changes have resolved the issues.

Cross validation

Cross validation is a method of sampling that involves leaving some parts of the data out of the fitting process and then seeing whether those data that are left out are close or far away from where the model predicts they would be. What that means practically is that cross validation techniques fit the model many, many times with a portion of the data and compares each model fit to the portion it did not use. If the models very rarely describe the data that they were not trained on, then the model is probably wrong.

A recent development in medical statistics is the use of a cross-validation metric in meta-analysis. It forms the basis of the validation statistic, Vn which is used to test the statistical validity of meta-analysis summary estimates.^[10] It has also been used in a more conventional sense in meta-analysis to estimate the likely prediction error of meta-analysis results.^[11]

References

^ Feng, Cheng; Zhong, Chaoliang; Wang, Jie; Zhang, Ying; Sun, Jun; Yokota, Yasuto (July 2022). "Learning Unforgotten Domain-Invariant Representations for Online Unsupervised Domain Adaptation". Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 2958–2965. doi:10.24963/ijcai.2022/410. ISBN 978-1-956792-00-3.
^ ^a ^b ^c National Research Council (2012), "Chapter 5: Model validation and prediction", Assessing the Reliability of Complex Models: Mathematical and statistical foundations of verification, validation, and uncertainty quantification, Washington, DC: National Academies Press, pp. 52–85, doi:10.17226/13395, ISBN 978-0-309-25634-6{{citation}}: CS1 maint: multiple names: authors list (link).
^ Batzel, J. J.; Bachar, M.; Karemaker, J. M.; Kappel, F. (2013), "Chapter 1: Merging mathematical and physiological knowledge", in Batzel, J. J.; Bachar, M.; Kappel, F. (eds.), Mathematical Modeling and Validation in Physiology, Springer, pp. 3–19, doi:10.1007/978-3-642-32882-4_1.
^ Deaton, M. L. (2006), "Simulation models, validation of", in Kotz, S.; et al. (eds.), Encyclopedia of Statistical Sciences, Wiley.
^ Mayer, D. G.; Butler, D.G. (1993), "Statistical validation", Ecological Modelling, 68 (1–2): 21–32, doi:10.1016/0304-3800(93)90105-2.
^ "Understanding Diagnostic Plots for Linear Regression Analysis | UVA Library". library.virginia.edu. Retrieved 2025-10-02.
^ "Residual plots for Fit Regression Model and Linear Regression". support.minitab.com. Retrieved 2025-10-02.
^ "7.2 - Model Diagnostics | STAT 504". online.stat.psu.edu. Retrieved 2025-10-02.
^ "RPubs - Residuals and Diagnostics For linear regression". rpubs.com. Retrieved 2025-10-02.
^ Willis, Brian H.; Riley, Richard D. (20 September 2017). "Measuring the statistical validity of summary meta-analysis and meta-regression results for use in clinical practice". Statistics in Medicine. 36 (21): 3283–3301. doi:10.1002/sim.7372. PMC 5575530. PMID 28620945.
^ Riley, Richard D.; Ahmed, Ikhlaaq; Debray, Thomas P. A.; Willis, Brian H.; Noordzij, J. Pieter; Higgins, Julian P.T.; Deeks, Jonathan J. (15 June 2015). "Summarising and validating test accuracy results across multiple studies for use in clinical practice". Statistics in Medicine. 34 (13): 2081–2103. doi:10.1002/sim.6471. PMC 4973708. PMID 25800943.

External links

How can I tell if a model fits my data? —Handbook of Statistical Methods (NIST)
Hicks, Dan (July 14, 2017). "What are core statistical model validation techniques?". Stack Exchange.

[1] Feng, Cheng; Zhong, Chaoliang; Wang, Jie; Zhang, Ying; Sun, Jun; Yokota, Yasuto (July 2022). "Learning Unforgotten Domain-Invariant Representations for Online Unsupervised Domain Adaptation". Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 2958–2965. doi:10.24963/ijcai.2022/410. ISBN 978-1-956792-00-3.

[NRC12-2] National Research Council (2012), "Chapter 5: Model validation and prediction", Assessing the Reliability of Complex Models: Mathematical and statistical foundations of verification, validation, and uncertainty quantification, Washington, DC: National Academies Press, pp. 52–85, doi:10.17226/13395, ISBN 978-0-309-25634-6{{citation}}: CS1 maint: multiple names: authors list (link).

[BBKK-3] Batzel, J. J.; Bachar, M.; Karemaker, J. M.; Kappel, F. (2013), "Chapter 1: Merging mathematical and physiological knowledge", in Batzel, J. J.; Bachar, M.; Kappel, F. (eds.), Mathematical Modeling and Validation in Physiology, Springer, pp. 3–19, doi:10.1007/978-3-642-32882-4_1.

[ESS06-4] Deaton, M. L. (2006), "Simulation models, validation of", in Kotz, S.; et al. (eds.), Encyclopedia of Statistical Sciences, Wiley.

[MB93-5] Mayer, D. G.; Butler, D.G. (1993), "Statistical validation", Ecological Modelling, 68 (1–2): 21–32, doi:10.1016/0304-3800(93)90105-2.

[6] "Understanding Diagnostic Plots for Linear Regression Analysis | UVA Library". library.virginia.edu. Retrieved 2025-10-02.

[7] "Residual plots for Fit Regression Model and Linear Regression". support.minitab.com. Retrieved 2025-10-02.

[8] "7.2 - Model Diagnostics | STAT 504". online.stat.psu.edu. Retrieved 2025-10-02.

[9] "RPubs - Residuals and Diagnostics For linear regression". rpubs.com. Retrieved 2025-10-02.

[10] Willis, Brian H.; Riley, Richard D. (20 September 2017). "Measuring the statistical validity of summary meta-analysis and meta-regression results for use in clinical practice". Statistics in Medicine. 36 (21): 3283–3301. doi:10.1002/sim.7372. PMC 5575530. PMID 28620945.

[11] Riley, Richard D.; Ahmed, Ikhlaaq; Debray, Thomas P. A.; Willis, Brian H.; Noordzij, J. Pieter; Higgins, Julian P.T.; Deeks, Jonathan J. (15 June 2015). "Summarising and validating test accuracy results across multiple studies for use in clinical practice". Statistics in Medicine. 34 (13): 2081–2103. doi:10.1002/sim.6471. PMC 4973708. PMID 25800943.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

History

Statistical model validation

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Statistical model validation

Overview

Validation with existing data

Validation with new data

A note of caution

Methods for validating

Residual diagnostics

Common diagnostic plots for regression

Cross validation

See also

References

Further reading

External links