Statistical model
View on WikipediaA statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process.[1] When referring specifically to probabilities, the corresponding term is probabilistic model. All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).[2]
Introduction
[edit]Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.
The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is 1/6. From that assumption, we can calculate the probability of both dice coming up 5: 1/6 × 1/6 = 1/36. More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6). The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is 1/8 (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5: 1/8 × 1/8 = 1/64. We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.
The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event. In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.
Formal definition
[edit]In mathematical terms, a statistical model is a pair (), where is the set of possible observations, i.e. the sample space, and is a set of probability distributions on .[3] The set represents all of the models that are considered possible. This set is typically parameterized: . The set defines the parameters of the model. If a parameterization is such that distinct parameter values give rise to distinct distributions, i.e. (in other words, the mapping is injective), it is said to be identifiable.[3]
In some cases, the model can be more complex.
- In Bayesian statistics, the model is extended by adding a probability distribution over the parameter space .
- A statistical model can sometimes distinguish two sets of probability distributions. The first set is the set of models considered for inference. The second set is the set of models that could have generated the data which is much larger than . Such statistical models are key in checking that a given procedure is robust, i.e. that it does not produce catastrophic errors when its assumptions about the data are incorrect.
An example
[edit]Suppose that we have a population of children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a linear regression model, like this: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to obtain a prediction of height, εi is the error term, and i identifies the child. This implies that height is predicted by age, with some error.
An admissible model must be consistent with all the data points. Thus, a straight line (heighti = b0 + b1agei) cannot be admissible for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, εi, must be included in the equation, so that the model is consistent with all the data points. To do statistical inference, we would first need to assume some probability distributions for the εi. For instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b0, b1, and the variance of the Gaussian distribution. We can formally specify the model in the form () as follows. The sample space, , of our model comprises the set of all possible pairs (age, height). Each possible value of = (b0, b1, σ2) determines a distribution on ; denote that distribution by . If is the set of all possible values of , then . (The parameterization is identifiable, and this is easy to check.)
In this example, the model is determined by (1) specifying and (2) making some assumptions relevant to . There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify —as they are required to do.
General remarks
[edit]A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic. Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process). Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".[4]
There are three purposes for a statistical model, according to Konishi & Kitagawa:[5]
- Predictions
- Extraction of information
- Description of stochastic structures
Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description.[6]
Dimension of a model
[edit]Suppose that we have a statistical model () with . In notation, we write that where k is a positive integer ( denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the model. The model is said to be parametric if has finite dimension.[citation needed] As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that
- .
In this example, the dimension, k, equals 2. As another example, suppose that the data consists of points (x, y) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note the set of all possible lines has dimension 2, even though geometrically, a line has dimension 1.)
Although formally is a single parameter that has dimension k, it is sometimes regarded as comprising k separate parameters. For example, with the univariate Gaussian distribution, is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation. A statistical model is nonparametric if the parameter set is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if k is the dimension of and n is the number of samples, both semiparametric and nonparametric models have as . If as , then the model is semiparametric; otherwise, the model is nonparametric.
Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".[7]
Nested models
[edit]This section needs additional citations for verification. (November 2023) |
Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model
- y = b0 + b1x + b2x2 + ε, ε ~ 𝒩(0, σ2)
has, nested within it, the linear model
- y = b0 + b1x + ε, ε ~ 𝒩(0, σ2)
—we constrain the parameter b2 to equal 0.
In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As an example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.
Comparing models
[edit]Comparing statistical models is fundamental for much of statistical inference. Konishi & Kitagawa (2008, p. 75) state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models." Common criteria for comparing models include the following: R2, Bayes factor, Akaike information criterion, and the likelihood-ratio test together with its generalization, the relative likelihood.
Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam.[8]
See also
[edit]- All models are wrong
- Blockmodel
- Conceptual model
- Design of experiments
- Deterministic model
- Effective theory
- Predictive model
- Response modeling methodology
- SackSEER
- Scientific model
- Statistical inference
- Statistical model specification
- Statistical model validation
- Statistical theory
- Stochastic process
Notes
[edit]- ^ Cox 2006, p. 178
- ^ Adèr 2008, p. 280
- ^ a b McCullagh 2002
- ^ Cox 2006, p. 197
- ^ Konishi & Kitagawa 2008, §1.1
- ^ Friendly & Meyer 2016, §11.6
- ^ Cox 2006, p. 2
- ^ Le Cam, Lucien (1964). "Sufficiency and Approximate Sufficiency". Annals of Mathematical Statistics. 35 (4). Institute of Mathematical Statistics: 1429. doi:10.1214/aoms/1177700372.
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (September 2010) |
References
[edit]- Adèr, H. J. (2008), "Modelling", in Adèr, H. J.; Mellenbergh, G. J. (eds.), Advising on Research Methods: A consultant's companion, Huizen, The Netherlands: Johannes van Kessel Publishing, pp. 271–304.
- Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference (2nd ed.), Springer-Verlag.
- Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press, doi:10.1017/CBO9780511813559.
- Friendly, M.; Meyer, D. (2016), Discrete Data Analysis with R, Chapman & Hall.
- Konishi, S.; Kitagawa, G. (2008), Information Criteria and Statistical Modeling, Springer.
- McCullagh, P. (2002), "What is a statistical model?" (PDF), Annals of Statistics, 30 (5): 1225–1310, doi:10.1214/aos/1035844977.
Further reading
[edit]- Davison, A. C. (2008), Statistical Models, Cambridge University Press
- Drton, M.; Sullivant, S. (2007), "Algebraic statistical models" (PDF), Statistica Sinica, 17: 1273–1297
- Freedman, D. A. (2009), Statistical Models, Cambridge University Press
- Helland, I. S. (2010), Steps Towards a Unified Basis for Scientific Models and Methods, World Scientific
- Kroese, D. P.; Chan, J. C. C. (2014), Statistical Modeling and Computation, Springer
- Shmueli, G. (2010), "To explain or to predict?", Statistical Science, 25 (3): 289–310, arXiv:1101.0891, doi:10.1214/10-STS330, S2CID 15900983
Statistical model
View on GrokipediaFundamentals
Introduction
A statistical model serves as a mathematical representation of real-world phenomena, incorporating elements of randomness and uncertainty to describe, predict, or explain observed patterns in data.[3][4] Unlike purely mathematical equations that assume fixed relationships, these models acknowledge variability inherent in natural processes, allowing for probabilistic outcomes rather than deterministic predictions.[5][6] The primary purpose of a statistical model is to quantify uncertainty surrounding observations, enabling inferences about broader populations based on limited samples and facilitating hypothesis testing to evaluate competing explanations.[7][8] For instance, consider the simple case of rolling two fair six-sided dice: the probability of both landing on 5 is $ \frac{1}{36} $, a basic probabilistic calculation that models chance and variability in random events. This analogy highlights how statistical models formalize such uncertainties to make reliable predictions beyond observed data. Statistical models play a crucial role in data analysis across various domains, aiding decision-making under incomplete information. In economics, they forecast market behaviors and assess policy impacts; in biology, they interpret genetic sequences and population dynamics; and in machine learning, they underpin algorithms for pattern recognition and forecasting.[9][3][8]Historical development
The foundations of statistical modeling trace back to the 17th and 18th centuries, when early probability theory began addressing variability in data. Jacob Bernoulli's Ars Conjectandi (1713) introduced the law of large numbers, establishing that the average of independent observations converges to the expected value, providing a cornerstone for modeling random processes.[10] Building on this, Abraham de Moivre's 1733 approximation of the binomial distribution to the normal curve offered a practical tool for handling large-scale variability, influencing subsequent probabilistic frameworks.[11] In the 19th century, advancements shifted toward systematic estimation techniques. Carl Friedrich Gauss's Theoria Motus Corporum Coelestium (1809) formalized the method of least squares for fitting linear models to observational data, minimizing errors under a normal distribution assumption and enabling precise astronomical predictions.[12] Concurrently, Pierre-Simon Laplace developed precursors to Bayesian inference through inverse probability in works like his 1774 memoir and later expansions in 1781 and 1786, allowing updates to probabilities based on evidence and laying groundwork for modern inferential modeling.[13] The 20th century marked the maturation of parametric and testing frameworks. Ronald Fisher's 1922 paper "On the Mathematical Foundations of Theoretical Statistics" advanced parametric models via maximum likelihood estimation, providing a unified approach to parameter inference and model selection.[14] In the 1930s, Jerzy Neyman and Egon Pearson's collaboration, notably their 1933 paper in Philosophical Transactions of the Royal Society, introduced hypothesis testing with Neyman-Pearson lemma, emphasizing power and error control for decision-making under uncertainty.[15] Post-World War II, non-parametric methods emerged to relax distributional assumptions, with Frank Wilcoxon's 1945 rank-sum test exemplifying distribution-free alternatives for comparing groups.[16] Computational advances in the modern era expanded model complexity. From the 1980s, Bayesian networks, pioneered by Judea Pearl's 1985 framework for evidential reasoning, integrated graphical structures with probabilistic inference for handling dependencies in complex systems.[17] The 1990s and 2000s saw deeper integrations with machine learning, such as support vector machines (1995) and random forests (2001), blending statistical rigor with algorithmic scalability for high-dimensional data.[18] In the 21st century, the rise of big data catalyzed a key shift from descriptive statistics—summarizing past observations—to predictive modeling, leveraging vast datasets and computational power for forecasting, as seen in applications of machine learning to clinical and economic predictions.[19]Definition and Framework
Formal definition
A statistical model is formally defined as a pair , where is the sample space consisting of all possible outcomes or observations from an experiment, and is a family of probability measures defined on .[20] This structure provides a mathematical framework for describing the uncertainty in data generation processes. More precisely, the sample space is equipped with a -algebra of measurable events, forming a measurable space , on which the probability measures in are defined; these measures assign probabilities to subsets of events in .[20] In the parametric case, the family takes the form , where is the parameter space indexing the distributions, and each is a probability measure on .[20] For the model to allow unique inference about parameters, it must satisfy the identifiability condition: if , then .[21] Unlike a fixed probability model, which specifies a single probability distribution, a statistical model defines a class of distributions from which the true generating mechanism is selected to fit observed data, enabling flexibility in statistical inference.[2] The likelihood function, central to parameter estimation within the model, is given in general form by the probability (or density) of the observed data under a specific parameter value:Components and assumptions
A statistical model typically comprises several core components that formalize the probabilistic structure of the observed data. The primary elements include random variables representing the observations. In models that describe relationships between variables, such as regression models, these are often divided into response variables (often denoted as $ Y $, the outcome) and covariates (denoted as $ X $, the predictors). Parameters, denoted collectively as $ \theta $, quantify the model's characteristics and are treated as fixed unknowns in frequentist approaches or as random variables with prior distributions in Bayesian frameworks. In certain models with additive structure, error terms, often symbolized as $ \epsilon $, capture unexplained variation or noise, assuming they arise from an underlying probability distribution.[2][22] Central to the model's inferential framework are key assumptions that ensure the validity of statistical procedures. Observations are commonly assumed to be independent and identically distributed (i.i.d.), meaning each data point is drawn from the same probability distribution without influence from others. In many parametric models, such as linear regression, errors are further assumed to follow a normal distribution with mean zero and constant variance (homoscedasticity), alongside linearity or additivity in the relationship between predictors and the response. These assumptions underpin the model's ability to generalize from sample data to population inferences.[23][24] The likelihood function plays a pivotal role in linking these components to data for parameter estimation and hypothesis testing. For i.i.d. observations $ x_1, \dots, x_n $ from a distribution parameterized by $ \theta $, it is defined asExamples and Illustrations
Basic examples
One of the simplest statistical models is the Bernoulli model, which describes outcomes of a single trial with two possible results, such as a coin flip where success (X=1) occurs with probability p and failure (X=0) with probability 1-p.[27] The probability mass function is given by:Applied examples
Statistical models find extensive application in real-world scenarios where data-driven insights inform decision-making across diverse fields. In these contexts, models are fitted to observed data to uncover patterns, predict outcomes, and quantify uncertainties, often assuming parametric forms with specified error distributions to enable inference. For instance, linear regression serves as a foundational tool for modeling continuous relationships, such as the growth in children's height as a function of age.[37] A classic application of linear regression appears in pediatric studies, where researchers model a child's height $ Y $ against their age using the equationwith three parameters ($ \beta_0 $, $ \beta_1 $, and $ \sigma^2 $) estimated from longitudinal growth data. This model captures the linear trend in height increase during early childhood, allowing predictions of expected stature and identification of growth deviations for clinical intervention.[37] Such fitting extracts insights like average annual height gain, aiding in nutritional assessments and early detection of disorders. In medical diagnostics, logistic regression addresses binary outcomes, such as the presence or absence of a disease, by modeling the log-odds of the probability $ P $ as
where $ x $ represents a predictor like exposure level or biomarker value. This approach is widely used for classification tasks, estimating disease risk from patient covariates and enabling probabilistic predictions that guide screening protocols.[38] For example, in cardiovascular research, it quantifies the association between risk factors and event occurrence, supporting targeted prevention strategies.[39] Time series models, particularly the autoregressive model of order one (AR(1)), are applied to financial data like stock prices to account for temporal dependencies and autocorrelation. The model is specified as
where $ X_t $ denotes the price at time $ t $, $ \phi $ measures persistence from the prior period, and $ \epsilon_t $ is white noise. Fitting this to historical stock returns reveals short-term momentum or mean-reversion patterns, informing trading algorithms and volatility forecasts.[40] Through parameter estimation, such models predict future price trajectories, helping investors assess market risks from observed fluctuations.[41] Beyond these, statistical models facilitate data fitting to derive actionable insights, such as trend projections in environmental monitoring or risk probabilities in insurance underwriting, by optimizing parameters to minimize discrepancies between predictions and data. In epidemiology, survival models like the Cox proportional hazards framework analyze time-to-event data, such as patient remission durations post-treatment, to evaluate intervention efficacy while censoring incomplete observations.[42] In economics, regression-based or time series models forecast product demand by relating sales to variables like income and pricing, optimizing inventory and pricing decisions in supply chains.[43] These applications underscore the models' role in translating empirical patterns into predictive and explanatory power across disciplines.
