Hubbry Logo
logo
Statistical model
Community hub

Statistical model

logo
0 subscribers
Read side by side
from Wikipedia

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process.[1] When referring specifically to probabilities, the corresponding term is probabilistic model. All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).[2]

Introduction

[edit]

Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.

The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is 1/6. From that assumption, we can calculate the probability of both dice coming up 5:  1/6 × 1/6 = 1/36.  More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6). The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is 1/8 (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5:  1/8 × 1/8 = 1/64.  We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.

The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event. In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.

Formal definition

[edit]

In mathematical terms, a statistical model is a pair (), where is the set of possible observations, i.e. the sample space, and is a set of probability distributions on .[3] The set represents all of the models that are considered possible. This set is typically parameterized: . The set defines the parameters of the model. If a parameterization is such that distinct parameter values give rise to distinct distributions, i.e. (in other words, the mapping is injective), it is said to be identifiable.[3]

In some cases, the model can be more complex.

  • In Bayesian statistics, the model is extended by adding a probability distribution over the parameter space .
  • A statistical model can sometimes distinguish two sets of probability distributions. The first set is the set of models considered for inference. The second set is the set of models that could have generated the data which is much larger than . Such statistical models are key in checking that a given procedure is robust, i.e. that it does not produce catastrophic errors when its assumptions about the data are incorrect.

An example

[edit]

Suppose that we have a population of children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a linear regression model, like this: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to obtain a prediction of height, εi is the error term, and i identifies the child. This implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, a straight line (heighti = b0 + b1agei) cannot be admissible for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, εi, must be included in the equation, so that the model is consistent with all the data points. To do statistical inference, we would first need to assume some probability distributions for the εi. For instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b0, b1, and the variance of the Gaussian distribution. We can formally specify the model in the form () as follows. The sample space, , of our model comprises the set of all possible pairs (age, height). Each possible value of  = (b0, b1, σ2) determines a distribution on ; denote that distribution by . If is the set of all possible values of , then . (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying and (2) making some assumptions relevant to . There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify —as they are required to do.

General remarks

[edit]

A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic. Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process). Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".[4]

There are three purposes for a statistical model, according to Konishi & Kitagawa:[5]

  1. Predictions
  2. Extraction of information
  3. Description of stochastic structures

Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description.[6]

Dimension of a model

[edit]

Suppose that we have a statistical model () with . In notation, we write that where k is a positive integer ( denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the model. The model is said to be parametric if has finite dimension.[citation needed] As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that

.

In this example, the dimension, k, equals 2. As another example, suppose that the data consists of points (x, y) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note the set of all possible lines has dimension 2, even though geometrically, a line has dimension 1.)

Although formally is a single parameter that has dimension k, it is sometimes regarded as comprising k separate parameters. For example, with the univariate Gaussian distribution, is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation. A statistical model is nonparametric if the parameter set is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if k is the dimension of and n is the number of samples, both semiparametric and nonparametric models have as . If as , then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".[7]

Nested models

[edit]

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model

y = b0 + b1x + b2x2 + ε,    ε ~ 𝒩(0, σ2)

has, nested within it, the linear model

y = b0 + b1x + ε,    ε ~ 𝒩(0, σ2)

—we constrain the parameter b2 to equal 0.

In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As an example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.

Comparing models

[edit]

Comparing statistical models is fundamental for much of statistical inference. Konishi & Kitagawa (2008, p. 75) state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models." Common criteria for comparing models include the following: R2, Bayes factor, Akaike information criterion, and the likelihood-ratio test together with its generalization, the relative likelihood.

Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam.[8]

See also

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A statistical model is a mathematical framework that formalizes a set of assumptions about the probability distribution generating observed data, typically represented as a family of probability distributions on a sample space.[1] Formally, it specifies a collection of possible distributions Φ\Phi that could have produced a data sample ξ\xi, viewed as a realization of an underlying random vector XiX_i, thereby restricting the infinite possibilities of data-generating processes to a manageable set for analysis.[1] In parameterized forms, the model includes a parameter space Θ\Theta and a mapping P:ΘP(S)P: \Theta \to \mathcal{P}(S) that assigns a distribution PθP_\theta to each parameter value θ\theta, where SS is the sample space.[2] Statistical models serve as essential tools for inference and decision-making across disciplines including natural sciences, engineering, economics, and social sciences, enabling the quantification of uncertainty and relationships in data.[1] They support key tasks such as parameter estimation to infer unknown values from data, hypothesis testing to evaluate assumptions about the underlying process, and prediction to forecast unobserved outcomes based on fitted distributions.[1] Central components include the design, which maps experimental units to covariates; the sample space, comprising all possible response outcomes; and the family of distributions, which must satisfy consistency conditions like marginalization over subsets of data.[2] Models are categorized primarily into parametric types, which assume a specific distributional form (e.g., normal or Poisson) defined by a finite number of parameters, and nonparametric types, which impose minimal structure on the distribution to allow greater flexibility at the cost of requiring larger samples.[1] Effective use demands careful specification of assumptions—such as independence or homoscedasticity—which should be verified through diagnostics, alongside model selection criteria to balance fit and complexity, ensuring robust and interpretable results.[1]

Fundamentals

Introduction

A statistical model serves as a mathematical representation of real-world phenomena, incorporating elements of randomness and uncertainty to describe, predict, or explain observed patterns in data.[3][4] Unlike purely mathematical equations that assume fixed relationships, these models acknowledge variability inherent in natural processes, allowing for probabilistic outcomes rather than deterministic predictions.[5][6] The primary purpose of a statistical model is to quantify uncertainty surrounding observations, enabling inferences about broader populations based on limited samples and facilitating hypothesis testing to evaluate competing explanations.[7][8] For instance, consider the simple case of rolling two fair six-sided dice: the probability of both landing on 5 is $ \frac{1}{36} $, a basic probabilistic calculation that models chance and variability in random events. This analogy highlights how statistical models formalize such uncertainties to make reliable predictions beyond observed data. Statistical models play a crucial role in data analysis across various domains, aiding decision-making under incomplete information. In economics, they forecast market behaviors and assess policy impacts; in biology, they interpret genetic sequences and population dynamics; and in machine learning, they underpin algorithms for pattern recognition and forecasting.[9][3][8]

Historical development

The foundations of statistical modeling trace back to the 17th and 18th centuries, when early probability theory began addressing variability in data. Jacob Bernoulli's Ars Conjectandi (1713) introduced the law of large numbers, establishing that the average of independent observations converges to the expected value, providing a cornerstone for modeling random processes.[10] Building on this, Abraham de Moivre's 1733 approximation of the binomial distribution to the normal curve offered a practical tool for handling large-scale variability, influencing subsequent probabilistic frameworks.[11] In the 19th century, advancements shifted toward systematic estimation techniques. Carl Friedrich Gauss's Theoria Motus Corporum Coelestium (1809) formalized the method of least squares for fitting linear models to observational data, minimizing errors under a normal distribution assumption and enabling precise astronomical predictions.[12] Concurrently, Pierre-Simon Laplace developed precursors to Bayesian inference through inverse probability in works like his 1774 memoir and later expansions in 1781 and 1786, allowing updates to probabilities based on evidence and laying groundwork for modern inferential modeling.[13] The 20th century marked the maturation of parametric and testing frameworks. Ronald Fisher's 1922 paper "On the Mathematical Foundations of Theoretical Statistics" advanced parametric models via maximum likelihood estimation, providing a unified approach to parameter inference and model selection.[14] In the 1930s, Jerzy Neyman and Egon Pearson's collaboration, notably their 1933 paper in Philosophical Transactions of the Royal Society, introduced hypothesis testing with Neyman-Pearson lemma, emphasizing power and error control for decision-making under uncertainty.[15] Post-World War II, non-parametric methods emerged to relax distributional assumptions, with Frank Wilcoxon's 1945 rank-sum test exemplifying distribution-free alternatives for comparing groups.[16] Computational advances in the modern era expanded model complexity. From the 1980s, Bayesian networks, pioneered by Judea Pearl's 1985 framework for evidential reasoning, integrated graphical structures with probabilistic inference for handling dependencies in complex systems.[17] The 1990s and 2000s saw deeper integrations with machine learning, such as support vector machines (1995) and random forests (2001), blending statistical rigor with algorithmic scalability for high-dimensional data.[18] In the 21st century, the rise of big data catalyzed a key shift from descriptive statistics—summarizing past observations—to predictive modeling, leveraging vast datasets and computational power for forecasting, as seen in applications of machine learning to clinical and economic predictions.[19]

Definition and Framework

Formal definition

A statistical model is formally defined as a pair (S,P)(S, \mathcal{P}), where SS is the sample space consisting of all possible outcomes or observations from an experiment, and P\mathcal{P} is a family of probability measures defined on SS.[20] This structure provides a mathematical framework for describing the uncertainty in data generation processes. More precisely, the sample space SS is equipped with a σ\sigma-algebra F\mathcal{F} of measurable events, forming a measurable space (S,F)(S, \mathcal{F}), on which the probability measures in P\mathcal{P} are defined; these measures assign probabilities to subsets of events in F\mathcal{F}.[20] In the parametric case, the family takes the form P={PθθΘ}\mathcal{P} = \{P_\theta \mid \theta \in \Theta \}, where ΘRk\Theta \subset \mathbb{R}^k is the parameter space indexing the distributions, and each PθP_\theta is a probability measure on (S,F)(S, \mathcal{F}).[20] For the model to allow unique inference about parameters, it must satisfy the identifiability condition: if θθΘ\theta \neq \theta' \in \Theta, then PθPθP_\theta \neq P_{\theta'}.[21] Unlike a fixed probability model, which specifies a single probability distribution, a statistical model defines a class of distributions P\mathcal{P} from which the true generating mechanism is selected to fit observed data, enabling flexibility in statistical inference.[2] The likelihood function, central to parameter estimation within the model, is given in general form by the probability (or density) of the observed data under a specific parameter value:
L(θx)=P(X=xθ), L(\theta \mid x) = P(X = x \mid \theta),
where xSx \in S is the observed realization and P(θ)P(\cdot \mid \theta) denotes the measure PθP_\theta.[20]

Components and assumptions

A statistical model typically comprises several core components that formalize the probabilistic structure of the observed data. The primary elements include random variables representing the observations. In models that describe relationships between variables, such as regression models, these are often divided into response variables (often denoted as $ Y $, the outcome) and covariates (denoted as $ X $, the predictors). Parameters, denoted collectively as $ \theta $, quantify the model's characteristics and are treated as fixed unknowns in frequentist approaches or as random variables with prior distributions in Bayesian frameworks. In certain models with additive structure, error terms, often symbolized as $ \epsilon $, capture unexplained variation or noise, assuming they arise from an underlying probability distribution.[2][22] Central to the model's inferential framework are key assumptions that ensure the validity of statistical procedures. Observations are commonly assumed to be independent and identically distributed (i.i.d.), meaning each data point is drawn from the same probability distribution without influence from others. In many parametric models, such as linear regression, errors are further assumed to follow a normal distribution with mean zero and constant variance (homoscedasticity), alongside linearity or additivity in the relationship between predictors and the response. These assumptions underpin the model's ability to generalize from sample data to population inferences.[23][24] The likelihood function plays a pivotal role in linking these components to data for parameter estimation and hypothesis testing. For i.i.d. observations $ x_1, \dots, x_n $ from a distribution parameterized by $ \theta $, it is defined as
L(θx)=i=1nP(xiθ), L(\theta \mid x) = \prod_{i=1}^n P(x_i \mid \theta),
where $ P(x_i \mid \theta) $ is the probability mass or density function for each observation. This function quantifies the probability of the observed data given the parameters, serving as the foundation for maximum likelihood estimation and other inferential methods.[25][22] Violations of these assumptions can lead to serious inferential issues. For instance, dependence among observations or non-identical distributions may introduce bias in parameter estimates, while heteroscedasticity—unequal error variances—results in inefficient estimators and invalid standard errors, potentially leading to incorrect hypothesis tests and confidence intervals. Such failures undermine the model's reliability, emphasizing the need for robust checks.[26][24] To verify these assumptions, residual analysis is employed, where residuals (differences between observed and predicted values) are examined via plots such as residuals versus fitted values or normal probability plots. Deviations from randomness, such as patterns indicating non-linearity or heteroscedasticity, signal potential violations, guiding model refinement without delving into specific corrective techniques.[23][22]

Examples and Illustrations

Basic examples

One of the simplest statistical models is the Bernoulli model, which describes outcomes of a single trial with two possible results, such as a coin flip where success (X=1) occurs with probability p and failure (X=0) with probability 1-p.[27] The probability mass function is given by:
P(X=x)={pif x=11pif x=0 P(X = x) = \begin{cases} p & \text{if } x = 1 \\ 1 - p & \text{if } x = 0 \end{cases}
[28]
The Poisson model is used for counting the number of rare events occurring in a fixed interval, such as customer arrivals per hour, where the average rate is λ.[29] The probability mass function is:
P(X=k)=λkeλk!,k=0,1,2, P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \dots
[30]
For continuous data, the normal distribution model assumes observations follow a Gaussian distribution, often applied to hypothetical height measurements with mean μ and variance σ².[31] The probability density function is:
f(x)=1σ2πexp((xμ)22σ2) f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)
[32]
Parameter estimation in these models can be approached intuitively through the method of moments, matching sample moments to population moments—for instance, the sample mean estimates μ in the normal model or λ in the Poisson model—or via maximum likelihood estimation, which maximizes the likelihood function derived from the assumed distribution under independent and identically distributed observations.[33][34] For the Bernoulli model, the maximum likelihood estimator of p is the sample proportion of successes.[35] These examples represent basic univariate parametric models, where the distribution form is specified up to a few parameters, providing an introductory framework for understanding probability distributions over a sample space.[36]

Applied examples

Statistical models find extensive application in real-world scenarios where data-driven insights inform decision-making across diverse fields. In these contexts, models are fitted to observed data to uncover patterns, predict outcomes, and quantify uncertainties, often assuming parametric forms with specified error distributions to enable inference. For instance, linear regression serves as a foundational tool for modeling continuous relationships, such as the growth in children's height as a function of age.[37] A classic application of linear regression appears in pediatric studies, where researchers model a child's height $ Y $ against their age using the equation
Y=β0+β1age+ϵ,ϵN(0,σ2), Y = \beta_0 + \beta_1 \cdot \text{age} + \epsilon, \quad \epsilon \sim N(0, \sigma^2),

with three parameters ($ \beta_0 $, $ \beta_1 $, and $ \sigma^2 $) estimated from longitudinal growth data. This model captures the linear trend in height increase during early childhood, allowing predictions of expected stature and identification of growth deviations for clinical intervention.[37] Such fitting extracts insights like average annual height gain, aiding in nutritional assessments and early detection of disorders.
In medical diagnostics, logistic regression addresses binary outcomes, such as the presence or absence of a disease, by modeling the log-odds of the probability $ P $ as
[logit](/page/Logit)(P)=β0+β1x, \text{[logit](/page/Logit)}(P) = \beta_0 + \beta_1 x,

where $ x $ represents a predictor like exposure level or biomarker value. This approach is widely used for classification tasks, estimating disease risk from patient covariates and enabling probabilistic predictions that guide screening protocols.[38] For example, in cardiovascular research, it quantifies the association between risk factors and event occurrence, supporting targeted prevention strategies.[39]
Time series models, particularly the autoregressive model of order one (AR(1)), are applied to financial data like stock prices to account for temporal dependencies and autocorrelation. The model is specified as
Xt=ϕXt1+ϵt, X_t = \phi X_{t-1} + \epsilon_t,

where $ X_t $ denotes the price at time $ t $, $ \phi $ measures persistence from the prior period, and $ \epsilon_t $ is white noise. Fitting this to historical stock returns reveals short-term momentum or mean-reversion patterns, informing trading algorithms and volatility forecasts.[40] Through parameter estimation, such models predict future price trajectories, helping investors assess market risks from observed fluctuations.[41]
Beyond these, statistical models facilitate data fitting to derive actionable insights, such as trend projections in environmental monitoring or risk probabilities in insurance underwriting, by optimizing parameters to minimize discrepancies between predictions and data. In epidemiology, survival models like the Cox proportional hazards framework analyze time-to-event data, such as patient remission durations post-treatment, to evaluate intervention efficacy while censoring incomplete observations.[42] In economics, regression-based or time series models forecast product demand by relating sales to variables like income and pricing, optimizing inventory and pricing decisions in supply chains.[43] These applications underscore the models' role in translating empirical patterns into predictive and explanatory power across disciplines.

Model Characteristics

Types of statistical models

Statistical models can be classified based on their parameterization and flexibility in representing the underlying data-generating process. This classification highlights how models balance assumptions about the form of the probability distribution with the ability to adapt to data without rigid constraints. Key categories include parametric, non-parametric, and semi-parametric models, each with distinct structural properties.[44][45] Parametric models assume a specific functional form for the probability distribution, indexed by a finite-dimensional parameter space ΘRk\Theta \subset \mathbb{R}^k for some fixed kk. The model is defined as a family of distributions {Pθ:θΘ}\{P_\theta : \theta \in \Theta\}, where the parameters fully specify the distribution shape. For instance, the normal distribution is parameterized by its mean μ\mu and standard deviation σ\sigma, allowing efficient estimation and inference under the assumed form.[46][47][45] Non-parametric models, in contrast, do not impose a fixed functional form and operate in an infinite-dimensional parameter space, directly estimating the distribution from the data without assuming a specific family. These models, such as kernel density estimation, use flexible methods like smoothing over observations to approximate the underlying density, making them suitable for unknown or complex distributions.[44][45][48] Semi-parametric models combine elements of both, featuring a finite-dimensional parametric component alongside an unspecified infinite-dimensional part, providing a hybrid approach that relaxes some assumptions while retaining structure. A prominent example is the Cox proportional hazards model in survival analysis, which parameterizes the hazard ratio effects of covariates while leaving the baseline hazard function non-parametrically unspecified.[49][50][51] Beyond these core classes, other types address specific structural needs. Hierarchical models incorporate multi-level parameters to account for nested data structures, such as varying intercepts across groups in multilevel regression, enabling the modeling of dependencies at different scales. Bayesian models integrate prior distributions on parameters θ\theta, updating beliefs via posterior inference to incorporate uncertainty and external knowledge into the modeling process. Graphical models represent dependencies among variables using graphs, such as directed acyclic graphs (DAGs) in Bayesian networks, to encode conditional independencies and facilitate efficient computation in multivariate settings.[52][53][54] These types involve trade-offs in performance: parametric models offer high statistical efficiency and simplicity when assumptions hold, as they concentrate estimation power in few parameters, but they can fail dramatically under misspecification. Non-parametric models provide robustness to distributional assumptions by adapting flexibly to data, though at the cost of lower efficiency and higher computational demands, especially in small samples. Semi-parametric and other variants aim to mitigate these by blending strengths, balancing bias and variance in practical applications.[44][45][55]

Dimension and complexity

In statistical modeling, the dimension of a model quantifies its complexity and flexibility, primarily through the number of free parameters kk in parametric models, where the parameter space Θ\Theta is a finite-dimensional subset of Rk\mathbb{R}^k. This finite dimensionality allows for tractable estimation and inference under regularity conditions, as the model assumes the data-generating process belongs to a restricted family of distributions indexed by these kk parameters.[46] In contrast, non-parametric models possess an infinite-dimensional parameter space, enabling them to approximate arbitrary distributions without assuming a fixed form, though this comes at the cost of requiring larger sample sizes for reliable estimation.[56] Higher model dimension facilitates capturing intricate patterns in the data by reducing bias, but it simultaneously amplifies variance in parameter estimates and heightens the risk of overfitting, especially when the sample size nn is small compared to kk, leading to poor generalization to new data.[57] For instance, in a linear regression model with pp predictors, the dimension is p+1p + 1, encompassing the intercept and one coefficient per predictor:
dim=p+1 \dim = p + 1
This structure underscores how each additional parameter expands the model's expressive power while demanding more data to stabilize estimates.[58] To address limitations of raw parameter count, the concept of effective dimension provides a refined measure of model complexity. In analysis of variance (ANOVA), degrees of freedom serve as an effective dimension, representing the number of independent values free to vary after accounting for constraints imposed by the model, such as the number of groups minus one for between-group variation.[59] In machine learning contexts, the Vapnik-Chervonenkis (VC) dimension quantifies the shattering capacity of a hypothesis class—the largest set of points that can be arbitrarily labeled by functions in the class—offering a combinatorial bound on overfitting risk independent of the actual data distribution.[60] High dimensionality exacerbates the curse of dimensionality, where exponential growth in space volume results in data sparsity, complicating accurate estimation and inference by diluting local density and inflating the search space for optimal parameters. This phenomenon necessitates techniques like dimensionality reduction or regularization to mitigate challenges in high-dimensional regimes, where traditional asymptotic assumptions fail and non-asymptotic analyses become essential.[61]

Nested models

In statistics, nested models refer to a hierarchical relationship where one model, denoted as M1M_1, is a special case of a more general model M2M_2, such that the parameter space of M1M_1 is a proper subset of the parameter space of M2M_2.[62] For instance, a linear regression model is nested within a quadratic regression model, as the former can be obtained by constraining the coefficient of the quadratic term in the latter to zero.[22] This nesting structure implies that M1M_1 has fewer parameters or imposes additional restrictions on the parameters of M2M_2, allowing for direct comparisons of model adequacy through the difference in their complexities.[62] A primary method for testing nested models is the likelihood ratio test (LRT), which assesses whether the additional parameters in M2M_2 significantly improve the fit over M1M_1. The test statistic is given by
Λ=2log(LM1LM2), \Lambda = -2 \log \left( \frac{L_{M_1}}{L_{M_2}} \right),
where LM1L_{M_1} and LM2L_{M_2} are the maximized likelihoods under M1M_1 and M2M_2, respectively. Under the null hypothesis that M1M_1 is adequate (i.e., the extra parameters in M2M_2 are zero), Λ\Lambda asymptotically follows a chi-squared distribution with degrees of freedom equal to the difference in the dimensions of the parameter spaces, dim(M2)dim(M1)\dim(M_2) - \dim(M_1).[63][22] This asymptotic result, known as Wilks' theorem, holds under regularity conditions such as the identifiability of parameters and the existence of finite moments.[63] The LRT for nested models finds applications in hypothesis testing for the significance of added parameters, such as evaluating whether a specific coefficient β1=0\beta_1 = 0 in a regression context by comparing the full model against the reduced model excluding that term.[22] This approach facilitates sequential model building, where simpler models are expanded incrementally while assessing the statistical significance of each addition through p-values derived from the chi-squared distribution.[22] One key advantage is its ability to quantify the trade-off between model fit and parsimony in a hypothesis-testing framework, enabling researchers to justify model expansions based on empirical evidence.[22] However, the validity of the LRT relies on the correct specification of the larger model M2M_2; if M2M_2 is misspecified, the asymptotic chi-squared distribution may not hold, leading to invalid inference.[63] Additionally, the test assumes that the nesting occurs away from the boundary of the parameter space to ensure the regularity conditions for Wilks' theorem are met.[62]

Evaluation and Comparison

Criteria for model comparison

Model comparison in statistics involves evaluating competing models using quantitative criteria that balance goodness of fit to the observed data with penalties for model complexity to avoid overfitting. These criteria help assess how well a model explains the data while considering its simplicity and potential for generalization. A primary goodness-of-fit measure for linear regression models is the coefficient of determination, denoted $ R^2 $, which quantifies the proportion of variance in the dependent variable explained by the model.[64] It is calculated as
R2=1SSresSStot, R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}},
where $ SS_{\text{res}} $ is the residual sum of squares and $ SS_{\text{tot}} $ is the total sum of squares.[64] Higher values of $ R^2 $ indicate better fit, though it does not account for model complexity and can increase with additional predictors.[64] Information criteria provide a unified framework for comparing models by combining likelihood-based fit with a penalty for the number of parameters, denoted $ k $. The Akaike Information Criterion (AIC) is defined as
AIC=2logL+2k, \text{AIC} = -2 \log L + 2k,
where $ L $ is the maximum likelihood of the model. Lower AIC values indicate better models, as the criterion penalizes complexity to favor those with superior predictive accuracy. The Bayesian Information Criterion (BIC) extends this approach with a stronger penalty term involving sample size $ n $, given by $ \text{BIC} = -2 \log L + k \log n $. In Bayesian model comparison, the Bayes factor offers a direct measure of relative evidence between two models, $ M_1 $ and $ M_2 $, defined as
BF12=P(dataM1)P(dataM2), BF_{12} = \frac{P(\text{data} \mid M_1)}{P(\text{data} \mid M_2)},
where $ P(\text{data} \mid M_i) $ is the marginal likelihood under model $ M_i $.[65] Values greater than 1 favor $ M_1 $, with scales interpreting strengths such as "strong evidence" for $ BF_{12} > 10 $.[65] Additional metrics include the deviance, $ D = -2 \log L $, which serves as a goodness-of-fit statistic analogous to the residual sum of squares in generalized linear models. Lower deviance indicates better fit. For assessing out-of-sample predictive performance, cross-validation error estimates the expected prediction error by partitioning the data and evaluating model performance on held-out subsets.[66] AIC is particularly suited for model selection aimed at prediction, as its penalty promotes models that minimize expected prediction error.[67] In contrast, BIC is preferred for selecting the true underlying model, due to its stronger complexity penalty that ensures consistency in identifying the correct model as sample size grows.[67]

Model selection methods

Model selection methods encompass a variety of algorithmic procedures designed to identify the most appropriate statistical model from a set of candidates, balancing goodness-of-fit with generalization to unseen data. These techniques integrate information criteria, such as AIC, with iterative processes to navigate the space of possible models, particularly in scenarios involving multiple predictors or high dimensionality. Unlike purely evaluative criteria, these methods emphasize practical workflows for implementation, often incorporating thresholds or optimization steps to automate the selection process. Stepwise selection is an automated forward-backward procedure for building regression models by sequentially adding or removing predictor variables based on statistical significance or information criteria. In forward selection, variables are added one at a time if they significantly improve the model, typically using p-values below a threshold (e.g., 0.05) or reductions in AIC exceeding a specified value, starting from an intercept-only model. Backward elimination begins with all variables and removes the least significant one iteratively until no further removals meet the retention criterion, such as p > 0.10 or AIC increases. Bidirectional stepwise combines both, alternating additions and removals until convergence, as originally formalized for multiple regression analysis. This approach is computationally efficient for moderate numbers of variables but can lead to unstable selections due to its greedy nature. Cross-validation provides a resampling-based method to estimate a model's predictive performance by partitioning the data into subsets, training on some and validating on others, thereby simulating out-of-sample evaluation without requiring a separate holdout set. In k-fold cross-validation, the dataset is divided into k equally sized folds; the model is trained k times, each time using k-1 folds for training and the remaining fold for testing, with the average performance metric (e.g., mean squared error) serving as the selection criterion. For small datasets, leave-one-out cross-validation (k=n, where n is the sample size) approximates this by iteratively omitting a single observation, offering an nearly unbiased estimate of prediction error though at higher computational cost. This technique is particularly valuable for tuning hyperparameters or comparing models, as it mitigates optimism bias inherent in in-sample metrics.[68] Regularization methods embed model selection within the estimation process by penalizing model complexity, promoting sparsity and reducing overfitting in high-dimensional settings where the number of predictors exceeds observations. The Lasso (Least Absolute Shrinkage and Selection Operator) achieves this through L1 penalization, solving the optimization problem:
argminβyXβ22+λβ1 \arg\min_{\beta} \| y - X \beta \|_2^2 + \lambda \| \beta \|_1
where λ>0\lambda > 0 controls the penalty strength, driving less important coefficients to exactly zero for automatic variable selection while shrinking others toward zero. This makes Lasso suitable for sparse models, as demonstrated in linear regression applications, and can be tuned via cross-validation to select the optimal λ\lambda. Unlike ridge regression (L2 penalty), Lasso performs explicit selection, enhancing interpretability in feature-rich data.[69] Bayesian model averaging (BMA) addresses model uncertainty by assigning posterior probabilities to each candidate model and computing predictions as a weighted average, rather than committing to a single selected model. Under a Bayesian framework, the posterior model probability is proportional to the prior times the marginal likelihood, with weights reflecting evidential support from the data. Predictions are then formed as E[ydata]=mP(mdata)E[ym,data]\mathbb{E}[y^* | data] = \sum_m P(m | data) \mathbb{E}[y^* | m, data], where mm indexes models, integrating over the posterior distribution to hedge against selection errors. BMA is especially effective in scenarios with many similar models, providing more robust inference than point selection, as implemented in linear regression via Markov chain Monte Carlo for exploring the model space.[70] Best practices in model selection emphasize rigorous validation to guard against overfitting and data dredging, where excessive searching inflates apparent significance. Always perform final selection on a held-out test set independent of the tuning process to ensure unbiased performance estimates, and prefer cross-validation over in-sample criteria for hyperparameter choice. Avoid over-reliance on stepwise methods in large variable spaces due to their susceptibility to multiple testing issues; instead, combine regularization with ensemble techniques for stability. Document the entire selection pipeline to facilitate reproducibility and assess sensitivity to criteria thresholds.

References

User Avatar
No comments yet.