Hubbry Logo
search
logo

Robust statistics

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.[1][2]

Introduction

[edit]

Robust statistics seek to provide methods that emulate popular statistical methods, but are not unduly affected by outliers or other small departures from model assumptions. In statistics, classical estimation methods rely heavily on assumptions that are often not met in practice. In particular, it is often assumed that the data errors are normally distributed, at least approximately, or that the central limit theorem can be relied on to produce normally distributed estimates. Unfortunately, when there are outliers in the data, classical estimators often have very poor performance, when judged using the breakdown point and the influence function described below.

The practical effect of problems seen in the influence function can be studied empirically by examining the sampling distribution of proposed estimators under a mixture model, where one mixes in a small amount (1–5% is often sufficient) of contamination. For instance, one may use a mixture of 95% a normal distribution, and 5% a normal distribution with the same mean but significantly higher standard deviation (representing outliers).

Robust parametric statistics can proceed in two ways:

  • by designing estimators so that a pre-selected behaviour of the influence function is achieved
  • by replacing estimators that are optimal under the assumption of a normal distribution with estimators that are optimal for, or at least derived for, other distributions; for example, using the t-distribution with low degrees of freedom (high kurtosis) or with a mixture of two or more distributions.

Robust estimates have been studied for the following problems:

Definition

[edit]

There are various definitions of a "robust statistic". Strictly speaking, a robust statistic is resistant to errors in the results, produced by deviations from assumptions[3] (e.g., of normality). This means that if the assumptions are only approximately met, the robust estimator will still have a reasonable efficiency, and reasonably small bias, as well as being asymptotically unbiased, meaning having a bias tending towards 0 as the sample size tends towards infinity.

Usually, the most important case is distributional robustness - robustness to breaking of the assumptions about the underlying distribution of the data.[3] Classical statistical procedures are typically sensitive to "longtailedness" (e.g., when the distribution of the data has longer tails than the assumed normal distribution). This implies that they will be strongly affected by the presence of outliers in the data, and the estimates they produce may be heavily distorted if there are extreme outliers in the data, compared to what they would be if the outliers were not included in the data.

By contrast, more robust estimators that are not so sensitive to distributional distortions such as longtailedness are also resistant to the presence of outliers. Thus, in the context of robust statistics, distributionally robust and outlier-resistant are effectively synonymous.[3] For one perspective on research in robust statistics up to 2000, see Portnoy & He (2000).

Some experts prefer the term resistant statistics for distributional robustness, and reserve 'robustness' for non-distributional robustness, e.g., robustness to violation of assumptions about the probability model or estimator, but this is a minority usage. Plain 'robustness' to mean 'distributional robustness' is common.

When considering how robust an estimator is to the presence of outliers, it is useful to test what happens when an extreme outlier is added to the dataset, and to test what happens when an extreme outlier replaces one of the existing data points, and then to consider the effect of multiple additions or replacements.

Examples

[edit]

The mean is not a robust measure of central tendency. If the dataset is, e.g., the values {2,3,5,6,9}, then if we add another datapoint with value -1000 or +1000 to the data, the resulting mean will be very different from the mean of the original data. Similarly, if we replace one of the values with a datapoint of value -1000 or +1000 then the resulting mean will be very different from the mean of the original data.

The median is a robust measure of central tendency. Taking the same dataset {2,3,5,6,9}, if we add another datapoint with value -1000 or +1000 then the median will change slightly, but it will still be similar to the median of the original data. If we replace one of the values with a data point of value -1000 or +1000 then the resulting median will still be similar to the median of the original data.

Described in terms of breakdown points, the median has a breakdown point of 50%, meaning that half the points must be outliers before the median can be moved outside the range of the non-outliers, while the mean has a breakdown point of 0, as a single large observation can throw it off.

The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation and range are not.

Trimmed estimators and Winsorised estimators are general methods to make statistics more robust. L-estimators are a general class of simple statistics, often robust, while M-estimators are a general class of robust statistics, and are now the preferred solution, though they can be quite involved to calculate.

Speed-of-light data

[edit]

Gelman et al. in Bayesian Data Analysis (2004) consider a data set relating to speed-of-light measurements made by Simon Newcomb. The data sets for that book can be found via the Classic data sets page, and the book's website contains more information on the data.

Although the bulk of the data looks to be more or less normally distributed, there are two obvious outliers. These outliers have a large effect on the mean, dragging it towards them, and away from the center of the bulk of the data. Thus, if the mean is intended as a measure of the location of the center of the data, it is, in a sense, biased when outliers are present.

Also, the distribution of the mean is known to be asymptotically normal due to the central limit theorem. However, outliers can make the distribution of the mean non-normal, even for fairly large data sets. Besides this non-normality, the mean is also inefficient in the presence of outliers and less variable measures of location are available.

Estimation of location

[edit]

The plot below shows a density plot of the speed-of-light data, together with a rug plot (panel (a)). Also shown is a normal Q–Q plot (panel (b)). The outliers are visible in these plots.

Panels (c) and (d) of the plot show the bootstrap distribution of the mean (c) and the 10% trimmed mean (d). The trimmed mean is a simple, robust estimator of location that deletes a certain percentage of observations (10% here) from each end of the data, then computes the mean in the usual way. The analysis was performed in R and 10,000 bootstrap samples were used for each of the raw and trimmed means.

The distribution of the mean is clearly much wider than that of the 10% trimmed mean (the plots are on the same scale). Also whereas the distribution of the trimmed mean appears to be close to normal, the distribution of the raw mean is quite skewed to the left. So, in this sample of 66 observations, only 2 outliers cause the central limit theorem to be inapplicable.

Robust statistical methods, of which the trimmed mean is a simple example, seek to outperform classical statistical methods in the presence of outliers, or, more generally, when underlying parametric assumptions are not quite correct.

Whilst the trimmed mean performs well relative to the mean in this example, better robust estimates are available. In fact, the mean, median and trimmed mean are all special cases of M-estimators. Details appear in the sections below.

Estimation of scale

[edit]

The outliers in the speed-of-light data have more than just an adverse effect on the mean; the usual estimate of scale is the standard deviation, and this quantity is even more badly affected by outliers because the squares of the deviations from the mean go into the calculation, so the outliers' effects are exacerbated.

The plots below show the bootstrap distributions of the standard deviation, the median absolute deviation (MAD) and the Rousseeuw–Croux (Qn) estimator of scale.[4] The plots are based on 10,000 bootstrap samples for each estimator, with some Gaussian noise added to the resampled data (smoothed bootstrap). Panel (a) shows the distribution of the standard deviation, (b) of the MAD and (c) of Qn.

The distribution of standard deviation is erratic and wide, a result of the outliers. The MAD is better behaved, and Qn is a little bit more efficient than MAD. This simple example demonstrates that when outliers are present, the standard deviation cannot be recommended as an estimate of scale.

Manual screening for outliers

[edit]

Traditionally, statisticians would manually screen data for outliers, and remove them, usually checking the source of the data to see whether the outliers were erroneously recorded. Indeed, in the speed-of-light example above, it is easy to see and remove the two outliers prior to proceeding with any further analysis. However, in modern times, data sets often consist of large numbers of variables being measured on large numbers of experimental units. Therefore, manual screening for outliers is often impractical.

Outliers can often interact in such a way that they mask each other. As a simple example, consider a small univariate data set containing one modest and one large outlier. The estimated standard deviation will be grossly inflated by the large outlier. The result is that the modest outlier looks relatively normal. As soon as the large outlier is removed, the estimated standard deviation shrinks, and the modest outlier now looks unusual.

This problem of masking gets worse as the complexity of the data increases. For example, in regression problems, diagnostic plots are used to identify outliers. However, it is common that once a few outliers have been removed, others become visible. The problem is even worse in higher dimensions.

Robust methods provide automatic ways of detecting, downweighting (or removing), and flagging outliers, largely removing the need for manual screening. Care must be taken; initial data showing the ozone hole first appearing over Antarctica were rejected as outliers by non-human screening.[5]

Variety of applications

[edit]

Although this article deals with general principles for univariate statistical methods, robust methods also exist for regression problems, generalized linear models, and parameter estimation of various distributions.

Measures of robustness

[edit]

The basic tools used to describe and measure robustness are the breakdown point, the influence function and the sensitivity curve.

Breakdown point

[edit]

Intuitively, the breakdown point of an estimator is the proportion of incorrect observations (e.g. arbitrarily large observations) an estimator can handle before giving an incorrect (e.g., arbitrarily large) result. Usually, the asymptotic (infinite sample) limit is quoted as the breakdown point, although the finite-sample breakdown point may be more useful.[6] For example, given independent random variables and the corresponding realizations , we can use to estimate the mean. Such an estimator has a breakdown point of 0 (or finite-sample breakdown point of ) because we can make arbitrarily large just by changing any of .

The higher the breakdown point of an estimator, the more robust it is. Intuitively, we can understand that a breakdown point cannot exceed 50% because if more than half of the observations are contaminated, it is not possible to distinguish between the underlying distribution and the contaminating distribution Rousseeuw & Leroy (1987). Therefore, the maximum breakdown point is 0.5 and there are estimators which achieve such a breakdown point. For example, the median has a breakdown point of 0.5. The X% trimmed mean has a breakdown point of X%, for the chosen level of X. Huber (1981) and Maronna et al. (2019) contain more details. The level and the power breakdown points of tests are investigated in He, Simpson & Portnoy (1990).

Statistics with high breakdown points are sometimes called resistant statistics.[7]

Example: speed-of-light data

[edit]

In the speed-of-light example, removing the two lowest observations causes the mean to change from 26.2 to 27.75, a change of 1.55. The estimate of scale produced by the Qn method is 6.3. We can divide this by the square root of the sample size to get a robust standard error, and we find this quantity to be 0.78. Thus, the change in the mean resulting from removing two outliers is approximately twice the robust standard error.

The 10% trimmed mean for the speed-of-light data is 27.43. Removing the two lowest observations and recomputing gives 27.67. The trimmed mean is less affected by the outliers and has a higher breakdown point.

If we replace the lowest observation, −44, by −1000, the mean becomes 11.73, whereas the 10% trimmed mean is still 27.43. In many areas of applied statistics, it is common for data to be log-transformed to make them near symmetrical. Very small values become large negative when log-transformed, and zeroes become negatively infinite. Therefore, this example is of practical interest.

Empirical influence function

[edit]

The empirical influence function is a measure of the dependence of the estimator on the value of any one of the points in the sample. It is a model-free measure in the sense that it simply relies on calculating the estimator again with a different sample. On the right is Tukey's biweight function, which, as we will later see, is an example of what a "good" (in a sense defined later on) empirical influence function should look like.

In mathematical terms, an influence function is defined as a vector in the space of the estimator, which is in turn defined for a sample which is a subset of the population:

  1. is a probability space,
  2. is a measurable space (state space),
  3. is a parameter space of dimension ,
  4. is a measurable space,

For example,

  1. is any probability space,
  2. ,
  3. ,

The empirical influence function is defined as follows.

Let and are i.i.d. and is a sample from these variables. is an estimator. Let . The empirical influence function at observation is defined by:

What this means is that we are replacing the i-th value in the sample by an arbitrary value and looking at the output of the estimator. Alternatively, the EIF is defined as the effect, scaled by n+1 instead of n, on the estimator of adding the point to the sample.[citation needed]

Influence function and sensitivity curve

[edit]
Influence function when Tukey's biweight function (see section M-estimators below) is used as a loss function. Points with large deviation have no influence (y=0).

Instead of relying solely on the data, we could use the distribution of the random variables. The approach is quite different from that of the previous paragraph. What we are now trying to do is to see what happens to an estimator when we change the distribution of the data slightly: it assumes a distribution, and measures sensitivity to change in this distribution. By contrast, the empirical influence assumes a sample set, and measures sensitivity to change in the samples.[8]

Let be a convex subset of the set of all finite signed measures on . We want to estimate the parameter of a distribution in . Let the functional be the asymptotic value of some estimator sequence . We will suppose that this functional is Fisher consistent, i.e. . This means that at the model , the estimator sequence asymptotically measures the correct quantity.

Let be some distribution in . What happens when the data doesn't follow the model exactly but another, slightly different, "going towards" ?

We're looking at:

which is the one-sided Gateaux derivative of at , in the direction of .

Let . is the probability measure which gives mass 1 to . We choose . The influence function is then defined by:

It describes the effect of an infinitesimal contamination at the point on the estimate we are seeking, standardized by the mass of the contamination (the asymptotic bias caused by contamination in the observations). For a robust estimator, we want a bounded influence function, that is, one which does not go to infinity as x becomes arbitrarily large.

The empirical influence function uses the empirical distribution function instead of the distribution function , making use of the drop-in principle.

Desirable properties

[edit]

Properties of an influence function that bestow it with desirable performance are:

  1. Finite rejection point ,
  2. Small gross-error sensitivity ,
  3. Small local-shift sensitivity .

Rejection point

[edit]

Gross-error sensitivity

[edit]

Local-shift sensitivity

[edit]

This value, which looks a lot like a Lipschitz constant, represents the effect of shifting an observation slightly from to a neighbouring point , i.e., add an observation at and remove one at .

M-estimators

[edit]

(The mathematical context of this paragraph is given in the section on empirical influence functions.)

Historically, several approaches to robust estimation were proposed, including R-estimators and L-estimators. However, M-estimators now appear to dominate the field as a result of their generality, their potential for high breakdown points and comparatively high efficiency. See Huber (1981).

M-estimators are not inherently robust. However, they can be designed to achieve favourable properties, including robustness. M-estimator are a generalization of maximum likelihood estimators (MLEs) which is determined by maximizing or, equivalently, minimizing . In 1964, Huber proposed to generalize this to the minimization of , where is some function. MLE are therefore a special case of M-estimators (hence the name: "Maximum likelihood type" estimators).

Minimizing can often be done by differentiating and solving , where (if has a derivative).

Several choices of and have been proposed. The two figures below show four functions and their corresponding functions.

For squared errors, increases at an accelerating rate, whilst for absolute errors, it increases at a constant rate. When Winsorizing is used, a mixture of these two effects is introduced: for small values of x, increases at the squared rate, but once the chosen threshold is reached (1.5 in this example), the rate of increase becomes constant. This Winsorised estimator is also known as the Huber loss function.

Tukey's biweight (also known as bisquare) function behaves in a similar way to the squared error function at first, but for larger errors, the function tapers off.

Properties of M-estimators

[edit]

M-estimators do not necessarily relate to a probability density function. Therefore, off-the-shelf approaches to inference that arise from likelihood theory can not, in general, be used.

It can be shown that M-estimators are asymptotically normally distributed so that as long as their standard errors can be computed, an approximate approach to inference is available.

Since M-estimators are normal only asymptotically, for small sample sizes it might be appropriate to use an alternative approach to inference, such as the bootstrap. However, M-estimates are not necessarily unique (i.e., there might be more than one solution that satisfies the equations). Also, it is possible that any particular bootstrap sample can contain more outliers than the estimator's breakdown point. Therefore, some care is needed when designing bootstrap schemes.

Of course, as we saw with the speed-of-light example, the mean is only normally distributed asymptotically and when outliers are present the approximation can be very poor even for quite large samples. However, classical statistical tests, including those based on the mean, are typically bounded above by the nominal size of the test. The same is not true of M-estimators and the type I error rate can be substantially above the nominal level.

These considerations do not "invalidate" M-estimation in any way. They merely make clear that some care is needed in their use, as is true of any other method of estimation.

Influence function of an M-estimator

[edit]

It can be shown that the influence function of an M-estimator is proportional to ,[9] which means we can derive the properties of such an estimator (such as its rejection point, gross-error sensitivity or local-shift sensitivity) when we know its function.

with the given by:

Choice of ψ and ρ

[edit]

In many practical situations, the choice of the function is not critical to gaining a good robust estimate, and many choices will give similar results that offer great improvements, in terms of efficiency and bias, over classical estimates in the presence of outliers.[10]

Theoretically, functions are to be preferred,[clarification needed] and Tukey's biweight (also known as bisquare) function is a popular choice. Maronna et al.[11] recommend the biweight function with efficiency at the normal set to 85%.

Robust parametric approaches

[edit]

M-estimators do not necessarily relate to a density function and so are not fully parametric. Fully parametric approaches to robust modeling and inference, both Bayesian and likelihood approaches, usually deal with heavy-tailed distributions such as Student's t-distribution.

For the t-distribution with degrees of freedom, it can be shown that

For , the t-distribution is equivalent to the Cauchy distribution. The degrees of freedom is sometimes known as the kurtosis parameter. It is the parameter that controls how heavy the tails are. In principle, can be estimated from the data in the same way as any other parameter. In practice, it is common for there to be multiple local maxima when is allowed to vary. As such, it is common to fix at a value around 4 or 6. The figure below displays the -function for 4 different values of .

Example: speed-of-light data

[edit]

For the speed-of-light data, allowing the kurtosis parameter to vary and maximizing the likelihood, we get

Fixing and maximizing the likelihood gives

[edit]

A pivotal quantity is a function of data, whose underlying population distribution is a member of a parametric family, that is not dependent on the values of the parameters. An ancillary statistic is such a function that is also a statistic, meaning that it is computed in terms of the data alone. Such functions are robust to parameters in the sense that they are independent of the values of the parameters, but not robust to the model in the sense that they assume an underlying model (parametric family), and in fact, such functions are often very sensitive to violations of the model assumptions. Thus test statistics, frequently constructed in terms of these to not be sensitive to assumptions about parameters, are still very sensitive to model assumptions.

Replacing outliers and missing values

[edit]

Replacing missing data is called imputation. If there are relatively few missing points, there are some models which can be used to estimate values to complete the series, such as replacing missing values with the mean or median of the data. Simple linear regression can also be used to estimate missing values.[12] In addition, outliers can sometimes be accommodated in the data through the use of trimmed means, other scale estimators apart from standard deviation (e.g., MAD) and Winsorization.[13] In calculations of a trimmed mean, a fixed percentage of data is dropped from each end of an ordered data, thus eliminating the outliers. The mean is then calculated using the remaining data. Winsorizing involves accommodating an outlier by replacing it with the next highest or next smallest value as appropriate.[14]

However, using these types of models to predict missing values or outliers in a long time series is difficult and often unreliable, particularly if the number of values to be in-filled is relatively high in comparison with total record length. The accuracy of the estimate depends on how good and representative the model is and how long the period of missing values extends.[15] When dynamic evolution is assumed in a series, the missing data point problem becomes an exercise in multivariate analysis (rather than the univariate approach of most traditional methods of estimating missing values and outliers). In such cases, a multivariate model will be more representative than a univariate one for predicting missing values. The Kohonen self organising map (KSOM) offers a simple and robust multivariate model for data analysis, thus providing good possibilities to estimate missing values, taking into account their relationship or correlation with other pertinent variables in the data record.[14]

Standard Kalman filters are not robust to outliers. To this end Ting, Theodorou & Schaal (2007) have recently shown that a modification of Masreliez's theorem can deal with outliers.

One common approach to handle outliers in data analysis is to perform outlier detection first, followed by an efficient estimation method (e.g., the least squares). While this approach is often useful, one must keep in mind two challenges. First, an outlier detection method that relies on a non-robust initial fit can suffer from the effect of masking, that is, a group of outliers can mask each other and escape detection.[16] Second, if a high breakdown initial fit is used for outlier detection, the follow-up analysis might inherit some of the inefficiencies of the initial estimator.[17]

Use in machine learning

[edit]

Although influence functions have a long history in statistics, they were not widely used in machine learning due to several challenges. One of the primary obstacles is that traditional influence functions rely on expensive second-order derivative computations and assume model differentiability and convexity. These assumptions are limiting, especially in modern machine learning, where models are often non-differentiable, non-convex, and operate in high-dimensional spaces.

Koh & Liang (2017) addressed these challenges by introducing methods to efficiently approximate influence functions using second-order optimization techniques, such as those developed by Pearlmutter (1994), Martens (2010), and Agarwal, Bullins & Hazan (2017). Their approach remains effective even when the assumptions of differentiability and convexity degrade, enabling influence functions to be used in the context of non-convex deep learning models. They demonstrated that influence functions are a powerful and versatile tool that can be applied to a variety of tasks in machine learning, including:

  • Understanding Model Behavior: Influence functions help identify which training points are most “responsible” for a given prediction, offering insights into how models generalize from training data.
  • Debugging Models: Influence functions can assist in identifying domain mismatches—when the training data distribution does not match the test data distribution—which can cause models with high training accuracy to perform poorly on test data, as shown by Ben-David et al. (2010). By revealing which training examples contribute most to errors, developers can address these mismatches.
  • Dataset Error Detection: Noisy or corrupted labels are common in real-world data, especially when crowdsourced or adversarially attacked. Influence functions allow human experts to prioritize reviewing only the most impactful examples in the training set, facilitating efficient error detection and correction.
  • Adversarial Attacks: Models that rely heavily on a small number of influential training points are vulnerable to adversarial perturbations. These perturbed inputs can significantly alter predictions and pose security risks in machine learning systems where attackers have access to the training data (See adversarial machine learning).

Koh and Liang’s contributions have opened the door for influence functions to be used in various applications across machine learning, from interpretability to security, marking a significant advance in their applicability.

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Robust statistics is a branch of statistics focused on developing methods that remain stable and reliable when data deviate from idealized assumptions, such as the presence of outliers, heavier-tailed distributions, or violations of independence.[1] These deviations can severely undermine classical procedures like the sample mean or least squares regression, which assume perfect normality and no contamination, leading to biased or inefficient estimates.[2] The primary goal is to enhance the gross-error sensitivity of estimators, ensuring they perform well under realistic data conditions where contamination rates of 1-10% are common.[1] The field traces its roots to early 19th-century critiques of least squares by astronomers like Bessel (1818) and Newcomb (1886), who noted empirical distributions often have heavier tails than the normal assumed by Gauss (1821).[1] Modern robust statistics emerged in the mid-20th century, spurred by John Tukey's 1960 call for methods resilient to "bad" data points that could dominate analyses.[1] Peter J. Huber laid foundational work with his 1964 paper on robust estimation of location parameters, introducing M-estimators that minimize a robust loss function to balance efficiency under normality with resistance to outliers.[3] Frank Hampel further advanced the theory in 1968 by defining key robustness measures, including the influence function (quantifying an estimator's sensitivity to infinitesimal contamination at a point) and the breakdown point (the maximum fraction of corrupted data an estimator can tolerate before failing arbitrarily).[1] The field expanded rapidly through the 1970s and 1980s, with Huber's 1981 book providing a comprehensive theoretical framework.[4] Central to robust statistics are quantitative measures of performance: the breakdown point, where the median achieves 50% (resisting up to half the data as outliers) compared to 0% for the mean, and the influence function, which bounds the impact of any single observation to prevent undue leverage.[2] Notable methods include location estimators like the median and trimmed mean (discarding extremes for resistance at the cost of some efficiency), Huber's M-estimator (using a piecewise linear loss for 95% asymptotic efficiency under normality), and Tukey's biweight (a redescending estimator for stronger outlier rejection).[1] For multivariate and regression settings, high-breakdown techniques such as the Least Median of Squares (LMS) and Minimum Covariance Determinant (MCD), developed by Peter Rousseeuw in the 1980s, achieve near-50% breakdown while detecting leverage points.[5] More advanced approaches like S-estimators and MM-estimators combine high breakdown with high efficiency, making them suitable for contaminated datasets.[2] Robust methods are essential in fields like astronomy, engineering, and chemometrics, where data anomalies arise from measurement errors or model mismatches, improving inference reliability over classical alternatives.[6] Ongoing developments address computational challenges in high dimensions and integrate robustness with machine learning for scalable applications.[6]

Fundamentals

Introduction

Robust statistics encompasses statistical methods designed to yield reliable inferences even when the underlying data deviate from idealized model assumptions, such as the presence of outliers or incorrect distributional specifications.[7] These techniques aim to provide stable and interpretable results by mitigating the impact of anomalous observations, which can otherwise lead to misleading conclusions in classical procedures like least squares estimation.[8] The field emerged in the 1960s as a response to the vulnerabilities of traditional statistics, pioneered by John W. Tukey and collaborators who emphasized the need for procedures resistant to real-world data imperfections.[9] Tukey's advocacy for "resistant" or "robust" methods highlighted how classical estimators could fail dramatically under slight perturbations, prompting a shift toward exploratory and protective data analysis strategies.[10] A core objective of robust statistics is to balance high efficiency—optimal performance under the assumed model—with resilience to contamination, ensuring estimators remain effective across a range of plausible data-generating processes.[8] In the 1970s, Frank R. Hampel advanced this framework by developing tools to quantify and control the sensitivity of estimators to individual data points, laying foundational principles for modern robustness theory.[11]

Definition and Prerequisites

Robust statistics refers to a branch of statistical theory and methodology that develops estimators and tests capable of maintaining desirable properties, such as efficiency and reliability, even when the underlying data distribution deviates moderately from the idealized model assumed in classical statistics. These deviations may include outliers, heavy-tailed distributions, or other forms of contamination. The performance of robust procedures is typically assessed through asymptotic properties in large samples, particularly by minimizing the worst-case asymptotic variance over a neighborhood of plausible distributions surrounding the target model.[12] A key framework for evaluating robustness involves the contamination model, which formalizes potential deviations from the ideal distribution. In this model, the observed distribution PϵP_\epsilon is given by
Pϵ=(1ϵ)P+ϵQ, P_\epsilon = (1 - \epsilon) P + \epsilon Q,
where PP is the target (ideal) distribution, QQ is an arbitrary contaminating distribution, and ϵ[0,1)\epsilon \in [0, 1) represents the proportion of contamination, often taken to be small (e.g., ϵ=0.05\epsilon = 0.05 or 0.10.1) to reflect realistic scenarios. This model allows for the design of estimators that remain stable under limited gross errors while optimizing performance under PP.[12] In contrast to classical statistical methods, which assume that the data are drawn exactly from the specified model PP and derive optimal procedures under that exact fit, robust methods explicitly account for possible contamination by QQ, ensuring bounded degradation in performance. Classical approaches can exhibit extreme sensitivity to even a few outliers, leading to unreliable inferences, whereas robust alternatives prioritize stability across the ϵ\epsilon-neighborhood.[12] To engage with robust statistics, foundational knowledge from probability and classical estimation theory is required. Basic concepts include the expectation of a random variable XX, defined as E[X]=xdF(x)E[X] = \int_{-\infty}^{\infty} x \, dF(x) for its cumulative distribution function FF, and the variance Var(X)=E[(XE[X])2]\operatorname{Var}(X) = E[(X - E[X])^2], which measures dispersion around the mean. Classical estimators, such as the sample mean Xˉ=n1i=1nXi\bar{X} = n^{-1} \sum_{i=1}^n X_i for location and the sample standard deviation s=n1i=1n(XiXˉ)2s = \sqrt{n^{-1} \sum_{i=1}^n (X_i - \bar{X})^2} for scale, assume independent and identically distributed (i.i.d.) observations from a known parametric family. Under these assumptions, the sample mean is unbiased (E[Xˉ]=μE[\bar{X}] = \mu) and consistent, meaning Xˉpμ\bar{X} \xrightarrow{p} \mu as the sample size nn \to \infty. Further prerequisites involve asymptotic properties central to both classical and robust inference. Consistency ensures that an estimator θ^n\hat{\theta}_n converges in probability to the true parameter θ\theta as nn \to \infty. Asymptotic normality, a consequence of the central limit theorem for i.i.d. data with finite variance, states that n(Xˉμ)dN(0,σ2)\sqrt{n} (\bar{X} - \mu) \xrightarrow{d} N(0, \sigma^2), providing a normal approximation for inference in large samples. These properties form the basis for evaluating robust estimators, which extend them to contaminated settings while preserving similar guarantees under the target model PP.

Measures of Robustness

Breakdown Point

The breakdown point quantifies the global robustness of a statistical estimator by measuring the smallest fraction of data that must be corrupted or replaced by arbitrary outliers to cause the estimator to produce an unbounded or meaningless result, such as its value diverging to infinity or its bias exceeding any predefined bound. This concept, originally developed for location estimators, serves as a key indicator of an estimator's reliability against gross errors in the data.[13] Two variants of the breakdown point are commonly distinguished: the finite-sample breakdown point, which evaluates robustness for a fixed sample size nn by considering the minimal number of contaminated observations needed to spoil the estimator, and the asymptotic breakdown point, which assesses the limiting behavior as nn \to \infty.[14] The asymptotic version is particularly useful for theoretical comparisons, as it abstracts away from sample-specific details. The asymptotic breakdown point η\eta^* of an estimator TT for a distribution FF is formally defined as
η=inf{ϵ:supQT((1ϵ)F+ϵQ)T(F)=}, \eta^* = \inf\left\{ \epsilon : \sup_Q \left| T\bigl( (1-\epsilon) F + \epsilon Q \bigr) - T(F) \right| = \infty \right\},
where ϵ\epsilon is the contamination proportion and the supremum is taken over all possible contaminating distributions QQ. This represents the infimum of ϵ\epsilon values for which there exists some QQ that drives the estimator arbitrarily far from its true value under the uncontaminated FF. Illustrative examples highlight the range of breakdown points across estimators. The sample mean has a breakdown point of 0%, as even one outlier can shift its value indefinitely. In contrast, the sample median achieves the maximum possible breakdown point of 50%, requiring contamination of at least half the data to cause breakdown.[15] For the α\alpha-trimmed mean, which discards the outer αn/2\alpha n / 2 observations from each end (total proportion α\alpha trimmed) before computing the mean, the breakdown point is α/2\alpha / 2 (or (α/2)×100%(\alpha / 2) \times 100\%).[16] Despite its value as a global measure, the breakdown point has limitations: it focuses on worst-case catastrophic failure but overlooks finer-grained local robustness to small contaminations, and estimators designed for high breakdown points often exhibit reduced statistical efficiency when the data follow an ideal uncontaminated model like the normal distribution.[1]

Influence Function and Sensitivity

The influence function serves as a key diagnostic tool in robust statistics, quantifying the local impact of an individual observation on an estimator viewed as a functional TT of the underlying distribution FF. It measures the infinitesimal change in TT induced by contaminating FF with a small proportion of mass at a specific point xx. Formally, the influence function is defined as
IF(x;T,F)=limϵ0T((1ϵ)F+ϵδx)T(F)ϵ, IF(x; T, F) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)F + \epsilon \delta_x) - T(F)}{\epsilon},
where δx\delta_x denotes the Dirac delta distribution concentrated at xx.[11] This expression captures the rate of change of TT at FF in the direction of δx\delta_x.[11] The influence function's boundedness is central to assessing robustness: if IF(x;T,F)|IF(x; T, F)| remains finite for all xx, the estimator resists unbounded shifts from any single outlier, unlike classical estimators such as the mean, whose influence function is unbounded and linear in xx.[11] The gross-error sensitivity γ(T,F)=supxIF(x;T,F)\gamma^*(T, F) = \sup_x |IF(x; T, F)| provides a scalar summary of this maximum possible influence, with lower values indicating greater local robustness to point contamination.[11] For instance, the sample median has a gross-error sensitivity of 1/(2f(μ))1/(2f(\mu)), where f(μ)f(\mu) is the density at the median μ\mu, which is finite and typically small for symmetric distributions.[17] The sensitivity curve, obtained by plotting IF(x;T,F)IF(x; T, F) against xx, offers a visual representation of how sensitivity varies across the support of FF.[17] This curve highlights regions of high or low influence, aiding in the design of estimators that downweight extreme values; for example, a curve that rises initially but plateaus or declines for large x|x| signals effective outlier rejection.[17] Desirable properties of the influence function include boundedness and a redescending shape for large deviations from the center, which promotes rejection of substantial residuals and enhances overall stability.[17] Such redescending behavior aligns with the ψ-function in estimation procedures that diminish influence for outliers, contributing to qualitative robustness as defined by Hampel, where small perturbations in the data distribution lead to continuously small changes in the estimator.[18] Qualitative robustness requires the estimator to be continuous with respect to weak convergence of distributions and the Prokhorov metric on probability measures.[18] The influence function arises as the Gâteaux derivative of TT at FF in the direction δxF\delta_x - F, providing a first-order Taylor expansion for the contaminated functional: T((1ϵ)F+ϵδx)T(F)+ϵIF(x;T,F)T((1-\epsilon)F + \epsilon \delta_x) \approx T(F) + \epsilon \cdot IF(x; T, F).[11] This derivation, without requiring full differentiability assumptions, extends to von Mises functionals where T(F)=ϕ(x)dF(x)T(F) = \int \phi(x) dF(x) for some integrable ϕ\phi, yielding IF(x;T,F)=ϕ(x)ϕ(y)dF(y)IF(x; T, F) = \phi(x) - \int \phi(y) dF(y).[17] While complementary to global measures like the breakdown point—which evaluates resistance to substantial contamination fractions—the influence function specifically probes sensitivity to infinitesimal point masses.[11]

Core Methods

M-Estimators

M-estimators form a broad class of robust estimators introduced by Peter J. Huber in 1964 as a generalization of maximum likelihood estimators, designed to mitigate the influence of outliers in estimating parameters like location.[12] For a sample X1,,XnX_1, \dots, X_n from a distribution with location θ\theta and scale σ\sigma, the M-estimator θ^\hat{\theta} solves the estimating equation
i=1nψ(Xiθσ)=0, \sum_{i=1}^n \psi\left( \frac{X_i - \theta}{\sigma} \right) = 0,
where ψ\psi is an odd, non-decreasing function that bounds the influence of large residuals, unlike the unbounded ψ(u)=u\psi(u) = u in least squares.[12] Equivalently, θ^\hat{\theta} minimizes the objective function
θ^=argminθi=1nρ(Xiθσ^), \hat{\theta} = \arg\min_{\theta} \sum_{i=1}^n \rho\left( \frac{X_i - \theta}{\hat{\sigma}} \right),
with ψ=ρ\psi = \rho' and σ^\hat{\sigma} a robust scale estimate, such as the median absolute deviation.[12] Under regularity conditions, including the existence of moments for ψ\psi and ρ\rho, M-estimators possess desirable asymptotic properties: they are consistent, asymptotically normal with rate n\sqrt{n}, and their asymptotic variance is given by V=E[ψ2(U)][E[ψ(U)]]2V = \frac{E[\psi^2(U)]}{[E[\psi'(U)]]^2}, where UU follows the error distribution.[12] By tuning ψ\psi, these estimators can achieve near-maximum efficiency at the normal distribution—for instance, Huber's choice yields 95% efficiency while maintaining robustness—balancing bias and variance in contaminated models.[12] Common ψ\psi functions include Huber's, defined as ψ(u)=u\psi(u) = u for uc|u| \leq c and ψ(u)=csign(u)\psi(u) = c \cdot \operatorname{sign}(u) for u>c|u| > c, which linearly weights small residuals but caps the influence of outliers at threshold cc (often 1.345 for 95% efficiency).[12] Another popular choice is Tukey's biweight, with ρ(u)=16(1(1u2)3)\rho(u) = \frac{1}{6} \left(1 - (1 - u^2)^3 \right) for u1|u| \leq 1 (rescaled by cc) and constant thereafter, leading to a redescending ψ(u)=u(1u2)2\psi(u) = u (1 - u^2)^2 for u<c|u| < c that fully rejects extreme outliers by driving their weight to zero.[19] This redescending behavior enhances robustness in heavy-tailed data but requires careful initialization to avoid local minima.[19] Computing M-estimators typically involves iterative algorithms, with iteratively reweighted least squares (IRLS) being the standard approach: start with an initial θ^(0)\hat{\theta}^{(0)} (e.g., the median), compute weights wi=ψ(ri)/riw_i = \psi(r_i)/r_i where ri=(Xiθ^(k))/σ^r_i = (X_i - \hat{\theta}^{(k)})/\hat{\sigma}, and update θ^(k+1)\hat{\theta}^{(k+1)} via weighted least squares until convergence.[20] IRLS converges monotonically under bounded ρ\rho and suitable starting values, often within 10-20 iterations for moderate samples, though it may require safeguards like trimming for redescending ψ\psi.[20]

Robust Parametric Approaches

Robust parametric approaches extend the M-estimation framework to structured parametric models, such as linear regression, where the goal is to estimate parameters under assumed functional forms while mitigating outlier effects. These methods generalize maximum likelihood estimation by replacing the log-likelihood with a robust loss function ρ that bounds the contribution of deviant observations, often applied to standardized residuals divided by a robust scale estimate σ. This adaptation ensures that the estimator remains consistent and asymptotically normal under mild contamination of the error distribution.[21] In linear regression, the robust estimator β^\hat{\beta} solves an optimization problem that minimizes the sum of ρ over residuals:
β^=argminβi=1nρ(yixiTβσ^), \hat{\beta} = \arg\min_{\beta} \sum_{i=1}^n \rho\left( \frac{y_i - x_i^T \beta}{\hat{\sigma}} \right),
where σ^\hat{\sigma} is typically a high-breakdown scale estimator like the median absolute deviation to avoid inflation by outliers. A classic implementation uses Huber's loss, where ρ(r) = r²/2 for |r| ≤ k and ρ(r) = k(|r| - k/2) otherwise, with the tuning constant k = 1.345 chosen to achieve 95% asymptotic efficiency relative to least squares under normality; this approach downweights but does not eliminate large residuals, providing a Gross Error Sensitivity of approximately 1.3σ.[21][22] For enhanced robustness against leverage points and higher contamination levels, MM-estimators refine an initial high-breakdown S-estimator using a subsequent M-estimation step with a redescending ψ-function, such as Tukey's biweight, to achieve a breakdown point of up to 50% while retaining high efficiency (e.g., 85% or more at the normal). Developed by Yohai, this two-stage procedure ensures the final estimate is both affine-equivariant and computationally feasible for moderate dimensions, with the initial S-estimator minimizing a robust scale measure over subsets of the data.[23] Key challenges in robust parametric estimation include sensitive initialization for non-convex losses, where poor starting values can lead to inconsistent solutions, and the selection of tuning constants for the ψ- and ρ-functions, which balance breakdown point against efficiency—for example, smaller constants enhance robustness but reduce efficiency under clean data. In MM-estimators, the S-estimator serves as a robust initializer, but high-dimensional problems may require subsampling to manage computation, and constants like c=4.685 for biweight ρ ensure 95% efficiency while maintaining a 25% breakdown in the initial stage.[23][22]

Applications and Extensions

Handling Outliers and Missing Values

Robust strategies for handling outliers begin with their detection, which can be achieved using robust residuals derived from preliminary robust fits, such as those obtained via M-estimators, to identify observations that deviate significantly from the bulk of the data. Diagnostic plots, including quantile-quantile (Q-Q) plots scaled with robust measures of dispersion like the median absolute deviation, further aid in visualizing departures from assumed distributions and flagging potential outliers.[24] Once detected, outliers can be addressed through replacement techniques that mitigate their influence without complete removal. Winsorizing caps extreme values by replacing those below the α-quantile with the α-quantile value and those above the (1-α)-quantile with the (1-α)-quantile value, thereby preserving sample size while reducing outlier impact; a common choice is α=0.05, corresponding to the 5th and 95th percentiles. The winsorized mean is then computed as the arithmetic average of this adjusted dataset:
xˉw=1ni=1nxi(w), \bar{x}_w = \frac{1}{n} \sum_{i=1}^n x_i^{(w)},
where xi(w)=qαx_i^{(w)} = q_{\alpha} if xi<qαx_i < q_{\alpha}, xi(w)=xix_i^{(w)} = x_i if qαxiq1αq_{\alpha} \leq x_i \leq q_{1-\alpha}, and xi(w)=q1αx_i^{(w)} = q_{1-\alpha} if xi>q1αx_i > q_{1-\alpha}, with qαq_{\alpha} denoting the α-quantile of the data. For imputation of missing or outlier-replaced values, robust central tendency measures like the median of complete cases provide a non-parametric alternative that resists contamination. Missing values pose a related challenge, often compounded by outliers in the observed data, and robust multiple imputation addresses this by generating multiple plausible imputations based on M-estimators fitted to complete cases, ensuring that the imputation model itself is insensitive to aberrant observations. This approach involves iteratively drawing from the posterior predictive distribution using robust location and scale estimates, then analyzing each imputed dataset separately before combining results via Rubin's rules. In scenarios with high outlier proportions, such as model fitting in noisy environments, the Random Sample Consensus (RANSAC) algorithm offers a non-parametric method for robust estimation by repeatedly sampling minimal subsets to hypothesize models, evaluating consensus via inlier counts, and selecting the model with the largest inlier set while discarding outliers.[25] This iterative process, typically run for a fixed number of trials determined by desired confidence and outlier fraction, effectively isolates reliable data for parameter estimation in regression or geometric models.[25]

Use in Machine Learning

In machine learning, robust statistics addresses key challenges posed by real-world data imperfections, such as noisy labels, adversarial attacks, and imbalanced distributions in large-scale datasets. Noisy labels, often arising from human annotation errors or automated labeling processes, can degrade model performance by misleading optimization toward incorrect patterns, particularly in deep neural networks where memorization of noise amplifies errors. Adversarial attacks, which involve subtle perturbations to inputs that fool models while remaining imperceptible to humans, exploit vulnerabilities in gradient-based learning, leading to unreliable predictions in safety-critical applications like autonomous driving. Imbalanced data, prevalent in big data scenarios such as fraud detection or medical diagnostics, skews model training toward majority classes, resulting in poor generalization for minority instances and heightened sensitivity to outliers.[26][27] To mitigate these issues, robust techniques incorporate specialized loss functions and model modifications that downweight anomalous influences. Robust loss functions, such as the Huber loss, blend quadratic behavior for small errors with linear penalties for large deviations, making them less sensitive to outliers than mean squared error while remaining differentiable for gradient descent in neural networks. These functions derive from M-estimators in robust statistics, adapting classical estimation principles to machine learning optimization. Outlier-robust support vector machines (SVMs) extend standard SVMs by integrating convex outlier ablation or homotopy algorithms during training, which identify and downweight non-support vectors affected by noise, thereby preserving margin-based separation in high-dimensional spaces.[28] The Huber loss is defined as:
Lδ(y,f(x))={12(yf(x))2if yf(x)<δδyf(x)δ22otherwise L_\delta(y, f(x)) = \begin{cases} \frac{1}{2} (y - f(x))^2 & \text{if } |y - f(x)| < \delta \\ \delta |y - f(x)| - \frac{\delta^2}{2} & \text{otherwise} \end{cases}
Practical examples illustrate these techniques' impact. Robust principal component analysis (PCA) decomposes data into low-rank and sparse components, enabling effective anomaly detection by isolating outliers as the sparse residual, which has proven valuable in network traffic monitoring where it achieves low false-positive rates on packet data. In deep learning, certified robustness provides provable guarantees against adversarial perturbations, with post-2010s advances like randomized smoothing and interval bound propagation scaling to ImageNet-sized models and diverse architectures, ensuring epsilon-bounded resilience in classifiers.[29][30] Recent developments through 2025 have integrated robust statistics with federated learning to enhance privacy-preserving robustness across distributed systems. In federated settings, where models aggregate updates from decentralized clients without sharing raw data, robust aggregation rules—such as trimmed means or median-based methods—counter poisoning attacks and non-IID data heterogeneity, improving convergence and accuracy in meta-learning tasks like personalized recommendation. Systematic reviews highlight that these integrations, often combining contrastive learning with outlier detection, mitigate label noise and Byzantine faults while complying with differential privacy constraints, leading to improved performance on heterogeneous benchmarks.[31][32]

Comparisons to Classical Statistics

Classical estimators, such as the maximum likelihood estimator (MLE), achieve optimal asymptotic efficiency when the underlying model assumptions are satisfied. For instance, under a normal distribution, the sample mean serves as the MLE for the location parameter and attains an asymptotic relative efficiency (ARE) of 1.0 relative to itself. In contrast, robust estimators trade off some efficiency for resistance to deviations from the model. The sample median, a canonical robust location estimator, exhibits an ARE of approximately 0.64 compared to the mean under normality, though its efficiency improves significantly under heavier-tailed distributions, reaching about 0.96 for a Student's t-distribution with 5 degrees of freedom.[2] Classical approaches rely on strict adherence to parametric assumptions, such as exact normality of errors for the validity of the mean or least squares regression. Violations, even minor ones like outliers, can lead to substantial bias or inefficiency. Robust methods, however, are formulated to tolerate such deviations through models like ε-contamination, where the observed data distribution is (1-ε) times the ideal model plus ε times an arbitrary contaminating distribution, allowing reliable inference for small ε (typically up to 0.1–0.2). A key trade-off in robust statistics is between robustness and computational cost; highly robust procedures often require iterative optimization, increasing runtime compared to closed-form classical solutions like the mean. For example, the Huber M-estimator solves a weighted least squares problem via numerical methods, potentially demanding more resources than ordinary least squares. Additionally, achieving high robustness, such as a breakdown point near 0.5, typically reduces efficiency under ideal conditions, whereas estimators like those based on the t-distribution strike a moderate balance, offering reasonable robustness against moderate contamination while maintaining higher efficiency than the median for symmetric heavy-tailed errors.[2] Classical methods are appropriate for controlled environments with clean data satisfying model assumptions, ensuring maximal precision. Robust methods should be preferred in real-world scenarios where data messiness is common; literature on data mining and statistics notes that most real-world datasets contain outliers, which can severely bias classical estimates if unaddressed. The following table summarizes key comparisons for location estimators under the normal distribution, highlighting the trade-offs in breakdown point (maximum fraction of outliers tolerable before arbitrary breakdown) and ARE relative to the mean:
EstimatorBreakdown PointARE (normal)
Mean01.0
Median0.50.64
Huber M-estimator0.5≈0.95
These values illustrate how the mean excels in efficiency but fails with any outliers, the median provides maximal robustness at the cost of efficiency, and the Huber estimator offers a practical compromise with high efficiency and moderate robustness.[2][33]

Historical Development

The origins of robust statistics can be traced to the late 19th century, when astronomers and statisticians began addressing the sensitivity of classical estimators to outliers in observational data. Simon Newcomb's 1886 discussion provided one of the first modern frameworks for robust estimation, introducing the use of mixture models of normal densities to account for contaminated data distributions.[34] Earlier ideas, such as the median as a resistant measure of central tendency, had roots in 18th-century probability theory, with precursors in the works of Laplace and others who recognized the limitations of the arithmetic mean under non-normal errors, though Carl Friedrich Gauss's 1809 development of least squares emphasized assumptions of equal accuracy that later highlighted the need for robustness.[35] The 20th century saw a renewed push toward robust methods, particularly through John Tukey's exploratory data analysis in the 1960s, which emphasized graphical techniques and resistant summaries to uncover structure in potentially messy data.[36] Tukey's seminal 1960 paper, "A Survey of Sampling from Contaminated Distributions," formalized the concept of contamination models and inspired systematic investigations into robustness, marking a shift from assuming perfect data to designing procedures resilient to deviations. Key figures shaped the theoretical foundations in the mid-20th century. Peter J. Huber introduced M-estimators in 1964 as a generalization of maximum likelihood for handling gross errors, providing a minimax framework for location estimation under contamination. Frank R. Hampel advanced this in 1974 with the influence function, a tool to quantify the impact of individual observations on estimators and guide the design of qualitatively robust procedures. In the 1980s, Peter J. Rousseeuw developed high-breakdown-point methods, such as least median of squares regression in 1984 and minimum covariance determinant in 1985, which could withstand up to nearly 50% contamination without total failure. Milestones in the 1970s included dedicated academic gatherings, such as the 1970 Princeton seminar on robust statistics, which brought together leading researchers to discuss stability under model misspecification.[10] The decade also saw the emergence of robustness as a core theme in statistical conferences, culminating in events like the 1979 Boston symposium that disseminated early results. By the 1990s, practical implementation advanced with software integration; S-PLUS introduced robust libraries featuring M-estimators and regression tools, enabling widespread application in statistical computing environments.[37] In the modern era from the 2000s to 2025, robust statistics has integrated with big data challenges, developing scalable estimators for high-dimensional settings, such as median-of-means for sub-Gaussian robustness without strong moment assumptions.[38] Bayesian robustness gained traction, with partial prior specifications and sensitivity analyses addressing model uncertainty in complex posteriors, as explored in foundational overviews from the mid-2000s.[39] Post-2020, the field extended to deep learning, with seminal works on adversarial robustness using certified defenses and robust optimization to mitigate vulnerabilities in neural networks, exemplified by efficient training methods achieving state-of-the-art protection against white-box attacks.[40] Computational advances post-2010, including polynomial-time algorithms for high-breakdown covariance estimation, have been supported by open-source tools; in Python, libraries like statsmodels.robust (introduced around 2012) and the more recent RobPy package provide accessible implementations for robust regression and multivariate analysis.

References

User Avatar
No comments yet.