Recent from talks
Nothing was collected or created yet.
Estimator
View on WikipediaIn statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished.[1] For example, the sample mean is a commonly used estimator of the population mean.
There are point and interval estimators. The point estimators yield single-valued results. This is in contrast to an interval estimator, where the result would be a range of plausible values. "Single value" does not necessarily mean "single number", but includes vector valued or function valued estimators.
Estimation theory is concerned with the properties of estimators; that is, with defining properties that can be used to compare different estimators (different rules for creating estimates) for the same quantity, based on the same data. Such properties can be used to determine the best rules to use under given circumstances. However, in robust statistics, statistical theory goes on to consider the balance between having good properties, if tightly defined assumptions hold, and having worse properties that hold under wider conditions.
Background
[edit]An "estimator" or "point estimate" is a statistic (that is, a function of the data) that is used to infer the value of an unknown parameter in a statistical model. A common way of phrasing it is "the estimator is the method selected to obtain an estimate of an unknown parameter". The parameter being estimated is sometimes called the estimand. It can be either finite-dimensional (in parametric and semi-parametric models), or infinite-dimensional (semi-parametric and non-parametric models).[2] If the parameter is denoted then the estimator is traditionally written by adding a circumflex over the symbol: . Being a function of the data, the estimator is itself a random variable; a particular realization of this random variable is called the "estimate". Sometimes the words "estimator" and "estimate" are used interchangeably.
The definition places virtually no restrictions on which functions of the data can be called the "estimators". The attractiveness of different estimators can be judged by looking at their properties, such as unbiasedness, mean square error, consistency, asymptotic distribution, etc. The construction and comparison of estimators are the subjects of the estimation theory. In the context of decision theory, an estimator is a type of decision rule, and its performance may be evaluated through the use of loss functions.
When the word "estimator" is used without a qualifier, it usually refers to point estimation. The estimate in this case is a single point in the parameter space. There also exists another type of estimator: interval estimators, where the estimates are subsets of the parameter space.
The problem of density estimation arises in two applications. Firstly, in estimating the probability density functions of random variables and secondly in estimating the spectral density function of a time series. In these problems the estimates are functions that can be thought of as point estimates in an infinite dimensional space, and there are corresponding interval estimation problems.
Definition
[edit]Suppose a fixed parameter needs to be estimated. Then an "estimator" is a function that maps the sample space to a set of sample estimates. An estimator of is usually denoted by the symbol . It is often convenient to express the theory using the algebra of random variables: thus if X is used to denote a random variable corresponding to the observed data, the estimator (itself treated as a random variable) is symbolised as a function of that random variable, . The estimate for a particular observed data value (i.e. for ) is then , which is a fixed value. Often an abbreviated notation is used in which is interpreted directly as a random variable, but this can cause confusion.
Quantified properties
[edit]The following definitions and attributes are relevant.[3]
Error
[edit]For a given sample , the "error" of the estimator is defined as where is the parameter being estimated. The error, e, depends not only on the estimator (the estimation formula or procedure), but also on the sample.
Mean squared error
[edit]The mean squared error of is defined as the expected value (probability-weighted average, over all samples) of the squared errors; that is, It is used to indicate how far, on average, the collection of estimates are from the single parameter being estimated. Consider the following analogy. Suppose the parameter is the bull's-eye of a target, the estimator is the process of shooting arrows at the target, and the individual arrows are estimates (samples). Then high MSE means the average distance of the arrows from the bull's eye is high, and low MSE means the average distance from the bull's eye is low. The arrows may or may not be clustered. For example, even if all arrows hit the same point, yet grossly miss the target, the MSE is still relatively large. However, if the MSE is relatively low then the arrows are likely more highly clustered (than highly dispersed) around the target.
Sampling deviation
[edit]For a given sample , the sampling deviation of the estimator is defined as where is the expected value of the estimator. The sampling deviation, d, depends not only on the estimator, but also on the sample.
Variance
[edit]The variance of is the expected value of the squared sampling deviations; that is, . It is used to indicate how far, on average, the collection of estimates are from the expected value of the estimates. (Note the difference between MSE and variance.) If the parameter is the bull's-eye of a target, and the arrows are estimates, then a relatively high variance means the arrows are dispersed, and a relatively low variance means the arrows are clustered. Even if the variance is low, the cluster of arrows may still be far off-target, and even if the variance is high, the diffuse collection of arrows may still be unbiased. Finally, even if all arrows grossly miss the target, if they nevertheless all hit the same point, the variance is zero.
Bias
[edit]The bias of is defined as . It is the distance between the average of the collection of estimates, and the single parameter being estimated. The bias of is a function of the true value of so saying that the bias of is means that for every the bias of is .
There are two kinds of estimators: biased estimators and unbiased estimators. Whether an estimator is biased or not can be identified by the relationship between and 0:
- If , is biased.
- If , is unbiased.
The bias is also the expected value of the error, since . If the parameter is the bull's eye of a target and the arrows are estimates, then a relatively high absolute value for the bias means the average position of the arrows is off-target, and a relatively low absolute bias means the average position of the arrows is on target. They may be dispersed, or may be clustered. The relationship between bias and variance is analogous to the relationship between accuracy and precision.
The estimator is an unbiased estimator of if and only if . Bias is a property of the estimator, not of the estimate. Often, people refer to a "biased estimate" or an "unbiased estimate", but they really are talking about an "estimate from a biased estimator", or an "estimate from an unbiased estimator". Also, people often confuse the "error" of a single estimate with the "bias" of an estimator. That the error for one estimate is large, does not mean the estimator is biased. In fact, even if all estimates have astronomical absolute values for their errors, if the expected value of the error is zero, the estimator is unbiased. Also, an estimator's being biased does not preclude the error of an estimate from being zero in a particular instance. The ideal situation is to have an unbiased estimator with low variance, and also try to limit the number of samples where the error is extreme (that is, to have few outliers). Yet unbiasedness is not essential. Often, if just a little bias is permitted, then an estimator can be found with lower mean squared error and/or fewer outlier sample estimates.
An alternative to the version of "unbiased" above, is "median-unbiased", where the median of the distribution of estimates agrees with the true value; thus, in the long run half the estimates will be too low and half too high. While this applies immediately only to scalar-valued estimators, it can be extended to any measure of central tendency of a distribution: see median-unbiased estimators.
In a practical problem, can always have functional relationship with . For example, if a genetic theory states there is a type of leaf (starchy green) that occurs with probability , with . Then, for leaves, the random variable , or the number of starchy green leaves, can be modeled with a distribution. The number can be used to express the following estimator for : . One can show that is an unbiased estimator for :
Unbiased
[edit]
A desired property for estimators is the unbiased trait where an estimator is shown to have no systematic tendency to produce estimates larger or smaller than the true parameter. Additionally, unbiased estimators with smaller variances are preferred over larger variances because it will be closer to the "true" value of the parameter. The unbiased estimator with the smallest variance is known as the minimum-variance unbiased estimator (MVUE).
To find if your estimator is unbiased it is easy to follow along the equation , . With estimator T with and parameter of interest solving the previous equation so it is shown as the estimator is unbiased. Looking at the figure to the right despite being the only unbiased estimator, if the distributions overlapped and were both centered around then distribution would actually be the preferred unbiased estimator.
Expectation When looking at quantities in the interest of expectation for the model distribution there is an unbiased estimator which should satisfy the two equations below.
| E1 |
| E2 |
Variance Similarly, when looking at quantities in the interest of variance as the model distribution there is also an unbiased estimator that should satisfy the two equations below.
| V1 |
| V2 |
Note we are dividing by n − 1 because if we divided with n we would obtain an estimator with a negative bias which would thus produce estimates that are too small for . It should also be mentioned that even though is unbiased for the reverse is not true.[4]
Relationships among the quantities
[edit]- The mean squared error, variance, and bias, are related: i.e. mean squared error = variance + square of bias. In particular, for an unbiased estimator, the variance equals the mean squared error.
- The standard deviation of an estimator of (the square root of the variance), or an estimate of the standard deviation of an estimator of , is called the standard error of .
- The bias-variance tradeoff will be used in model complexity, over-fitting and under-fitting. It is mainly used in the field of supervised learning and predictive modelling to diagnose the performance of algorithms.
Example
[edit]Consider a random variable following a normal probability distribution , and a biased estimator of the mean of that distribution where follows a degenerate distribution, i.e. , such that where all the terms are zero except using the Bienaymé formula, and We verify the relation between the mean square error, the variance and the bias. Below are illustrated the quantified properties of the estimation of the probability distribution mean, taking , and .
Behavioral properties
[edit]Consistency
[edit]A consistent estimator is an estimator whose sequence of estimates converge in probability to the quantity being estimated as the index (usually the sample size) grows without bound. In other words, increasing the sample size increases the probability of the estimator being close to the population parameter.
Mathematically, an estimator is a consistent estimator for parameter θ, if and only if for the sequence of estimates {tn; n ≥ 0}, and for all ε > 0, no matter how small, we have
The consistency defined above may be called weak consistency. The sequence is strongly consistent, if it converges almost surely to the true value.
An estimator that converges to a multiple of a parameter can be made into a consistent estimator by multiplying the estimator by a scale factor, namely the true value divided by the asymptotic value of the estimator. This occurs frequently in estimation of scale parameters by measures of statistical dispersion.
Fisher consistency
[edit]An estimator can be considered Fisher consistent as long as the estimator is the same functional of the empirical distribution function as the true distribution function. Following the formula: Where and are the empirical distribution function and theoretical distribution function, respectively. An easy example to see if some estimator is Fisher consistent is to check the consistency of mean and variance. For example, to check consistency for the mean and to check for variance confirm that .[5]
Asymptotic normality
[edit]An asymptotically normal estimator is a consistent estimator whose distribution around the true parameter θ approaches a normal distribution with standard deviation shrinking in proportion to as the sample size n grows. Using to denote convergence in distribution, tn is asymptotically normal if for some V.
In this formulation V/n can be called the asymptotic variance of the estimator. However, some authors also call V the asymptotic variance. Note that convergence will not necessarily have occurred for any finite "n", therefore this value is only an approximation to the true variance of the estimator, while in the limit the asymptotic variance (V/n) is simply zero. To be more specific, the distribution of the estimator tn converges weakly to a dirac delta function centered at .
The central limit theorem implies asymptotic normality of the sample mean as an estimator of the true mean. More generally, maximum likelihood estimators are asymptotically normal under fairly weak regularity conditions — see the asymptotics section of the maximum likelihood article. However, not all estimators are asymptotically normal; the simplest examples are found when the true value of a parameter lies on the boundary of the allowable parameter region.
Efficiency
[edit]The efficiency of an estimator is used to estimate the quantity of interest in a "minimum error" manner. In reality, there is not an explicit best estimator; there can only be a better estimator. Whether the efficiency of an estimator is better or not is based on the choice of a particular loss function, and it is reflected by two naturally desirable properties of estimators: to be unbiased and have minimal mean squared error (MSE) . These cannot in general both be satisfied simultaneously: a biased estimator may have a lower mean squared error than any unbiased estimator (see estimator bias). This equation relates the mean squared error with the estimator bias:[4]
The first term represents the mean squared error; the second term represents the square of the estimator bias; and the third term represents the variance of the estimator. The quality of the estimator can be identified from the comparison between the variance, the square of the estimator bias, or the MSE. The variance of the good estimator (good efficiency) would be smaller than the variance of the bad estimator (bad efficiency). The square of an estimator bias with a good estimator would be smaller than the estimator bias with a bad estimator. The MSE of a good estimator would be smaller than the MSE of the bad estimator. Suppose there are two estimator, is the good estimator and is the bad estimator. The above relationship can be expressed by the following formulas.
Besides using formula to identify the efficiency of the estimator, it can also be identified through the graph. If an estimator is efficient, in the frequency vs. value graph, there will be a curve with high frequency at the center and low frequency on the two sides. For example:

If an estimator is not efficient, the frequency vs. value graph, there will be a relatively more gentle curve.

To put it simply, the good estimator has a narrow curve, while the bad estimator has a large curve. Plotting these two curves on one graph with a shared y-axis, the difference becomes more obvious.

Among unbiased estimators, there often exists one with the lowest variance, called the minimum variance unbiased estimator (MVUE). In some cases an unbiased efficient estimator exists, which, in addition to having the lowest variance among unbiased estimators, satisfies the Cramér–Rao bound, which is an absolute lower bound on variance for statistics of a variable.
Concerning such "best unbiased estimators", see also Cramér–Rao bound, Gauss–Markov theorem, Lehmann–Scheffé theorem, Rao–Blackwell theorem.
Robustness
[edit]See also
[edit]- Best linear unbiased estimator (BLUE)
- Invariant estimator
- Kalman filter
- Markov chain Monte Carlo (MCMC)
- Maximum a posteriori (MAP)
- Method of moments, generalized method of moments
- Minimum mean squared error (MMSE)
- Particle filter
- Pitman closeness criterion
- Sensitivity and specificity
- Shrinkage estimator
- Signal processing
- State observer
- Testimator
- Wiener filter
- Well-behaved statistic
References
[edit]- ^ Mosteller, F.; Tukey, J. W. (1987) [1968]. "Data Analysis, including Statistics". The Collected Works of John W. Tukey: Philosophy and Principles of Data Analysis 1965–1986. Vol. 4. CRC Press. pp. 601–720 [p. 633]. ISBN 0-534-05101-4 – via Google Books.
- ^ Kosorok (2008), Section 3.1, pp 35–39.
- ^ Jaynes (2007), p.172.
- ^ a b Dekking, Frederik Michel; Kraaikamp, Cornelis; Lopuhaä, Hendrik Paul; Meester, Ludolf Erwin (2005). A Modern Introduction to Probability and Statistics. Springer Texts in Statistics. ISBN 978-1-85233-896-1.
- ^ Lauritzen, Steffen. "Properties of Estimators" (PDF). University of Oxford. Retrieved 9 December 2023.
Further reading
[edit]- Bol'shev, Login Nikolaevich (2001) [1994], "Statistical estimator", Encyclopedia of Mathematics, EMS Press.
- Jaynes, E. T. (2007). Probability Theory: The logic of science (5 ed.). Cambridge University Press. ISBN 978-0-521-59271-0..
- Kosorok, Michael (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer Series in Statistics. Springer. doi:10.1007/978-0-387-74978-5. ISBN 978-0-387-74978-5.
- Lehmann, E. L.; Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. ISBN 0-387-98502-6.
- Shao, Jun (1998), Mathematical Statistics, Springer, ISBN 0-387-98674-X
External links
[edit]Estimator
View on GrokipediaBackground and History
Origins in Probability and Statistics
The concept of estimation in statistics traces its roots to early probability theory, particularly through Jacob Bernoulli's formulation of the law of large numbers in 1713. In his posthumously published work Ars Conjectandi, Bernoulli demonstrated that the relative frequency of an event in repeated trials converges to its true probability as the number of trials increases, providing a foundational idea for inferring unknown parameters from sample data.[10] This weak law of large numbers served as a precursor to estimation by establishing the reliability of sample proportions as approximations of population probabilities, influencing later developments in consistent estimation techniques.[10] Building on this probabilistic foundation, Pierre-Simon Laplace advanced the field in the late 18th century with his development of inverse probability, a method now recognized as the basis of Bayesian inference. In works from the 1770s, including his 1774 memoir, Laplace introduced the idea of calculating the probability of causes from observed effects, enabling point estimates of unknown parameters by combining prior beliefs with data.[11] This approach marked a significant step toward systematic statistical estimation, applying probability to reconcile inconsistent observations in fields like astronomy and geodesy.[11] Carl Friedrich Gauss further solidified estimation principles in 1809 with his least squares method, presented in Theoria Motus Corporum Coelestium. Gauss proposed minimizing the sum of squared residuals to estimate parameters in linear models, assuming normally distributed errors, which equated to maximum likelihood estimation under those conditions.[12] This technique represented an early formal estimator for parameters, bridging observational data and probabilistic models, and laid groundwork for modern regression analysis.[12] During the 19th century, statistical inference transitioned from these probabilistic origins toward more structured frequentist frameworks, emphasizing long-run frequencies over subjective priors. This shift promoted objective methods for parameter estimation based on repeated sampling, setting the stage for 20th-century developments in hypothesis testing and confidence intervals.[13]Key Milestones and Contributors
The development of estimator theory in the 20th century was profoundly shaped by Ronald A. Fisher's seminal 1922 paper, "On the Mathematical Foundations of Theoretical Statistics," which introduced maximum likelihood estimation as a method for obtaining estimators that maximize the probability of observing the given data under the assumed model.[14] In this work, Fisher also introduced the concept of estimator efficiency and derived an early lower bound for the variance of unbiased estimators based on the information content of the sample (using the second derivative of the log-likelihood, now known as Fisher information), providing a benchmark for comparing estimator performance. The Cramér–Rao lower bound, which formalizes this variance bound more generally, was later derived independently by Harald Cramér and C. Radhakrishna Rao in the 1940s.[14] Building on these foundations, Jerzy Neyman and Egon S. Pearson advanced the field in the 1930s through their collaborative efforts on hypothesis testing and estimation criteria. Their 1933 paper, "On the Problem of the Most Efficient Tests of Statistical Hypotheses," established the Neyman-Pearson lemma, which identifies uniformly most powerful tests and has implications for the selection of unbiased estimators that minimize error rates in decision-making contexts. This framework emphasized the role of unbiasedness in estimators, influencing subsequent evaluations of estimator reliability under finite samples. Asymptotic theory emerged as a cornerstone of modern estimator analysis in the mid-1940s, with key contributions from Harald Cramér and C. Radhakrishna Rao. Cramér's 1946 book, Mathematical Methods of Statistics, systematically developed asymptotic properties such as consistency and normality of estimators, deriving the Cramér-Rao lower bound for the asymptotic variance of unbiased estimators under regularity conditions.[15] Independently, Rao's 1945 paper, "Information and the Accuracy Attainable in the Estimation of Statistical Parameters," introduced the Fisher information matrix and its role in bounding estimator precision, laying groundwork for multiparameter asymptotic efficiency. Peter J. Huber's 1964 paper, "Robust Estimation of a Location Parameter," marked a pivotal shift toward robustness in estimator theory, proposing M-estimators that minimize the impact of outliers by using a convex loss function instead of squared error.[16] This approach demonstrated that robust estimators achieve near-efficiency under normal distributions while maintaining stability under contamination, influencing the design of outlier-resistant methods in applied statistics. In the late 20th century, computational innovations transformed estimator evaluation, exemplified by Bradley Efron's 1979 introduction of the bootstrap method in "Bootstrap Methods: Another Look at the Jackknife."[17] This resampling technique allows empirical estimation of an estimator's sampling distribution without parametric assumptions, enabling bias and variance assessment for complex statistics and extending accessibility to non-asymptotic properties.Core Concepts
Definition of an Estimator
In statistics, particularly within the framework of parametric inference, an estimator is formally defined as a function of a random sample drawn from a population, mapping the sample values to an approximation of an unknown parameter in the parameter space; that is, for some function .[18] This definition positions the estimator as a statistic specifically designed to infer the value of , where the sample is assumed to follow a probability distribution parameterized by .[19] A key distinction exists between the estimator itself, which is the general rule or procedure (often a formula), and the estimate, which is the concrete numerical value realized when the estimator is applied to a specific observed sample.[20] For instance, the sample mean serves as an estimator for the population mean , but a particular computation like from observed data constitutes the estimate.[21] Since the sample observations are random variables, the estimator inherits this randomness and is thus a random variable, subject to sampling variability that depends on the underlying distribution.[22] This random nature underscores the estimator's role in providing a probabilistic approximation to across repeated sampling from the parametric model.[18]Estimand, Statistic, and Estimate
In statistics, the estimand is the target quantity of interest that inference seeks to quantify, typically an unknown parameter of the population distribution or a functional thereof, such as the mean or variance.[23] This concept anchors the analysis by specifying precisely what aspect of the underlying distribution is being investigated, independent of the data collected.[23] A statistic is any function of the observable random sample drawn from the population, serving as a summary measure derived directly from the data.[24] Estimators form a subset of statistics, specifically those selected to approximate the estimand; for instance, the sample mean is an estimator targeting the population mean as the estimand when the sample arises from distributions sharing a common expected value.[25] In contrast, the estimate is the concrete numerical value produced by applying the estimator to a specific observed sample, such as computing from data points .[23] These terms highlight the progression from theoretical target (estimand) to data-driven approximation (statistic and estimator) to realized output (estimate), ensuring precise communication in inference.[23] Point estimators like the sample mean yield a single value, whereas related constructs such as confidence intervals provide a range of plausible estimand values to account for sampling variability, though they differ by quantifying uncertainty rather than pinpointing a single approximation.[26]Finite-Sample Properties
Bias and Unbiasedness
In statistics, the bias of an estimator for a parameter is defined as the difference between its expected value and the true parameter value: , where the expectation is taken over the sampling distribution of the data.[27] This measures the systematic tendency of the estimator to over- or underestimate the parameter on average across repeated samples from the population.[3] A positive bias indicates overestimation, while a negative bias indicates underestimation.[28] An estimator is unbiased if its expected value equals the true parameter for all possible values of : .[29] Unbiasedness is preserved under linear combinations; if and are unbiased for and , respectively, then is unbiased for for any constants and , due to the linearity of expectation.[30] A classic example is the sample mean , which is an unbiased estimator of the population mean for independent and identically distributed samples from any distribution with finite mean, as .[31] Unbiasedness ensures that the estimator is correct in the long run, meaning that over many repeated samples, the average value of will equal , providing reliability in terms of systematic accuracy.[32] However, it does not guarantee precision in individual samples, as the spread of estimates around the true value—measured by variance—can still be large.[33]Variance and Sampling Deviation
The variance of an estimator quantifies the expected squared deviation of the estimator from its own expected value, providing a measure of its dispersion across repeated samples from the same population. Formally, it is defined as where the expectation is taken over the sampling distribution of .[3] This definition parallels the variance of any random variable and highlights the inherent variability in due to sampling randomness, independent of any systematic offset from the true parameter value.[34] The sampling deviation of an estimator is captured by its standard deviation, , which represents the typical scale of fluctuations in around . In estimation contexts, this quantity is commonly termed the standard error of the estimator, serving as a practical indicator of precision in inferential procedures such as confidence intervals.[35] For instance, larger standard errors imply greater uncertainty in the estimate, often arising from limited data or inherent population variability. Several factors influence the variance of an estimator. Primarily, it decreases with increasing sample size , typically scaling as for many common estimators like the sample mean, thereby improving precision as more data are collected.[36] Additionally, the shape of the underlying distribution affects the variance; for example, distributions with heavier tails or higher kurtosis tend to yield estimators with larger variance due to greater spread in the data.[37] A concrete example is the sample variance estimator for a population variance , assuming independent and identically distributed observations . Under normality (), the variance of is given by , illustrating both the scaling with sample size and dependence on the population variance itself.[38] This formula underscores how the estimator's variability diminishes as grows, while remaining sensitive to .Mean Squared Error
The mean squared error (MSE) of an estimator for a parameter is defined as the expected value of the squared difference between the estimator and the true parameter value: This metric quantifies the average squared deviation of the estimator from over repeated samples, serving as a comprehensive measure of estimation accuracy in decision-theoretic frameworks.[3] The MSE decomposes into the variance of the estimator plus the square of its bias: where denotes the bias. This decomposition highlights the trade-off between bias, which reflects systematic over- or underestimation, and variance, which captures random fluctuations around the estimator's expected value; thus, MSE penalizes both sources of error equally in squared terms.[39] For unbiased estimators, where , the MSE simplifies to the variance, emphasizing the role of variability in such cases.[28] To compare estimators operating on different scales or under varying parameter ranges, the relative MSE is employed, typically computed as the ratio of one estimator's MSE to a benchmark, such as that of the sample mean. This normalization facilitates scale-invariant assessments of relative performance.[40] In minimax estimation criteria, MSE functions as the primary loss measure, with the objective of selecting an estimator that minimizes the supremum of the MSE over the possible values of , thereby ensuring robust performance against the worst-case scenario.[41]Relationships Among Properties
The mean squared error (MSE) of an estimator serves as a comprehensive measure of its total error, which decomposes into the squared bias and the variance, illustrating how systematic deviation from the true parameter combines with random sampling variability to determine overall accuracy. This decomposition reveals that the total error is not merely additive but reflects the interplay between these components, where minimizing one can influence the other.[42] A key relationship among these properties is the inherent trade-off between bias and variance: efforts to eliminate bias entirely, such as through unbiased estimators, can inflate variance, leading to higher MSE in finite samples, while introducing controlled bias can reduce variance and yield a superior MSE. Shrinkage estimators exemplify this trade-off, as they deliberately bias estimates toward a central value to curb excessive variability; the James-Stein estimator, for instance, shrinks sample means toward a grand mean in multivariate normal settings, dominating the maximum likelihood estimator in MSE despite its bias. The Cramér-Rao lower bound establishes a theoretical interconnection by imposing a minimum on the variance of any unbiased estimator, equal to the reciprocal of the Fisher information, thereby linking unbiasedness directly to achievable precision limits and highlighting why biased alternatives may sometimes achieve lower MSE. This bound unifies the properties by showing that variance cannot be arbitrarily reduced without bias or additional assumptions. In estimator selection, these relationships favor MSE minimization over strict unbiasedness in many applications, as biased estimators often provide better finite-sample performance, though maximum likelihood estimators achieve asymptotic efficiency under regularity conditions.Illustrative Example
To illustrate the finite-sample properties of estimators, consider estimating the population variance from an independent random sample drawn from a normal distribution with mean and variance . A commonly used estimator is the sample variance where is the sample mean. This estimator is unbiased, meaning its expected value equals the true parameter: , so the bias is zero.[43] The variance of is given by , which follows from the fact that follows a chi-squared distribution with degrees of freedom, whose variance is .[44] Consequently, the mean squared error is , since the bias term vanishes. An alternative estimator is the biased version Its expected value is , yielding a bias of .[43] The variance is , obtained by scaling the unbiased variance appropriately. The mean squared error is then This demonstrates the bias-variance trade-off: although introduces negative bias, its lower variance results in a smaller MSE compared to for all finite . To visualize this trade-off, assume and compute the MSE for small sample sizes. The following table shows the values:| Sample size | MSE of (unbiased) | MSE of (biased) |
|---|---|---|
| 2 | 2.000 | 0.750 |
| 5 | 0.500 | 0.360 |
| 10 | 0.222 | 0.190 |
| 20 | 0.105 | 0.098 |
