Recent from talks
Nothing was collected or created yet.
Clustered standard errors
View on WikipediaClustered standard errors (or Liang-Zeger standard errors)[1] are measurements that estimate the standard error of a regression parameter in settings where observations may be subdivided into smaller-sized groups ("clusters") and where the sampling and/or treatment assignment is correlated within each group.[2][3] Clustered standard errors are widely used in a variety of applied econometric settings, including difference-in-differences[4] or experiments.[5]
Analogous to how Huber-White standard errors are consistent in the presence of heteroscedasticity and Newey–West standard errors are consistent in the presence of accurately-modeled autocorrelation, clustered standard errors are consistent in the presence of cluster-based sampling or treatment assignment. Clustered standard errors are often justified by possible correlation in modeling residuals within each cluster; while recent work suggests that this is not the precise justification behind clustering,[6] it may be pedagogically useful.
Intuitive motivation
[edit]Clustered standard errors are often useful when treatment is assigned at the level of a cluster instead of at the individual level. For example, suppose that an educational researcher wants to discover whether a new teaching technique improves student test scores. She therefore assigns teachers in "treated" classrooms to try this new technique, while leaving "control" classrooms unaffected. When analyzing her results, she may want to keep the data at the student level (for example, to control for student-level observable characteristics). However, when estimating the standard error or confidence interval of her statistical model, she realizes that classical or even heteroscedasticity-robust standard errors are inappropriate because student test scores within each class are not independently distributed. Instead, students in classes with better teachers have especially high test scores (regardless of whether they receive the experimental treatment) while students in classes with worse teachers have especially low test scores. The researcher can cluster her standard errors at the level of a classroom to account for this aspect of her experiment.[7]
While this example is very specific, similar issues arise in a wide variety of settings. For example, in many panel data settings (such as difference-in-differences) clustering often offers a simple and effective way to account for non-independence between periods within each unit (sometimes referred to as "autocorrelation in residuals").[4] Another common and logically distinct justification for clustering arises when a full population cannot be randomly sampled, and so instead clusters are sampled and then units are randomized within cluster. In this case, clustered standard errors account for the uncertainty driven by the fact that the researcher does not observe large parts of the population of interest.[8]
Mathematical motivation
[edit]A useful mathematical illustration comes from the case of one-way clustering in an ordinary least squares (OLS) model. Consider a simple model with N observations that are subdivided in C clusters. Let be an vector of outcomes, a matrix of covariates, an vector of unknown parameters, and an vector of unexplained residuals:
As is standard with OLS models, we minimize the sum of squared residuals to get an estimate :
From there, we can derive the classic "sandwich" estimator:
Denoting yields a potentially more familiar form
While one can develop a plug-in estimator by defining and letting , this completely flexible estimator will not converge to as . Given the assumptions that a practitioner deems as reasonable, different types of standard errors solve this problem in different ways. For example, classic homoskedastic standard errors assume that is diagonal with identical elements , which simplifies the expression for . Huber-White standard errors assume is diagonal but that the diagonal value varies, while other types of standard errors (e.g. Newey–West, Moulton SEs, Conley spatial SEs) make other restrictions on the form of this matrix to reduce the number of parameters that the practitioner needs to estimate.
Clustered standard errors assume that is block-diagonal according to the clusters in the sample, with unrestricted values in each block but zeros elsewhere. In this case, one can define and as the within-block analogues of and and derive the following mathematical fact:
By constructing plug-in matrices , one can form an estimator for that is consistent as the number of clusters becomes large. While no specific number of clusters is statistically proven to be sufficient, practitioners often cite a number in the range of 30-50 and are comfortable using clustered standard errors when the number of clusters exceeds that threshold.
Alternatively, finite-sample modifications are also typically used, to reduce downwards bias due to finite C.[9] Often practitioners use the following bias corrected estimator:
However, more recent practice has shifted towards analogues of the heteroscedasticity-robust HC2 and HC3 estimators.[9] Often called the CR2 and CR3 estimators, these estimators are unbiased under certain assumptions. They also have been shown, especially when combined with degrees of freedom corrections for use in building confidence intervals, to produce better coverage rates when the number of clusters is not large.[10][11]
Further reading
[edit]- Alberto Abadie, Susan Athey, Guido W Imbens, and Jeffrey M Wooldridge. 2022. "When Should You Adjust Standard Errors for Clustering?" Quarterly Journal of Economics.
References
[edit]- ^ Liang, Kung-Yee; Zeger, Scott L. (1986-04-01). "Longitudinal data analysis using generalized linear models". Biometrika. 73 (1): 13–22. doi:10.1093/biomet/73.1.13. ISSN 0006-3444.
- ^ Cameron, A. Colin; Miller, Douglas L. (2015-03-31). "A Practitioner's Guide to Cluster-Robust Inference". Journal of Human Resources. 50 (2): 317–372. CiteSeerX 10.1.1.703.724. doi:10.3368/jhr.50.2.317. ISSN 0022-166X. S2CID 1296789.
- ^ "ARE 212". Fiona Burlig. Retrieved 2020-07-05.
- ^ a b Bertrand, Marianne; Duflo, Esther; Mullainathan, Sendhil (2004-02-01). "How Much Should We Trust Differences-In-Differences Estimates?". The Quarterly Journal of Economics. 119 (1): 249–275. doi:10.1162/003355304772839588. hdl:1721.1/63690. ISSN 0033-5533. S2CID 470667.
- ^ Yixin Tang (2019-09-11). "Analyzing Switchback Experiments by Cluster Robust Standard Error to prevent false positive results". DoorDash Engineering Blog. Retrieved 2020-07-05.
- ^ Abadie, Alberto; Athey, Susan; Imbens, Guido; Wooldridge, Jeffrey (2017-10-24). "When Should You Adjust Standard Errors for Clustering?". arXiv:1710.02926 [math.ST].
- ^ "CLUSTERED STANDARD ERRORS". Economic Theory Blog. 2016. Archived from the original on 2016-11-06. Retrieved 28 September 2021.
- ^ "When should you cluster standard errors? New wisdom from the econometrics oracle". blogs.worldbank.org. Retrieved 2020-07-05.
- ^ a b "A Practitioner's Guide to Cluster-Robust Inference" (PDF). UC Davis - Economics. Retrieved 2024-07-04.
- ^ Bell, Robert M; McCaffrey, Daniel F (December 2002). "Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples" (PDF). Survey Methodology. 28 (2): 169–181.
- ^ Imbens, Guido W; Kolesár, Michal (October 2012). "Robust Standard Errors in Small Samples: Some Practical Advice". NBER Working Paper No. w18478.
Clustered standard errors
View on Grokipediavce(cluster clustvar) option or R's sandwich package.[2]
Background on Regression Standard Errors
Ordinary Least Squares Standard Errors
In ordinary least squares (OLS) regression, standard errors quantify the precision of the coefficient estimates by measuring the variability around the point estimates of the regression parameters.[4] These estimates arise from minimizing the sum of squared residuals in the linear model , where is the response vector, is the design matrix, is the parameter vector, and is the error term.[5] The classical linear regression model underpinning OLS relies on key assumptions about the error term, including homoskedasticity—meaning the errors have constant variance across all levels of the independent variables—and independence, implying no correlation between errors for different observations. These assumptions ensure that the OLS estimator is unbiased and has minimum variance among linear unbiased estimators, as established by the Gauss-Markov theorem.[6] Under these assumptions, the variance-covariance matrix of the OLS estimator is given by where denotes the error variance, estimated from the residuals as , with observations and parameters.[4] The individual standard errors are then obtained as the square roots of the diagonal elements of this matrix, providing the basis for inference such as t-tests on the coefficients.[5] This framework traces its origins to the Gauss-Markov theorem, first articulated by Carl Friedrich Gauss in the 1820s, which demonstrated the optimality of least squares under the specified assumptions, later formalized in modern econometrics.[7]Violations of Independence in Data
In ordinary least squares (OLS) regression, the standard errors assume that observations are independent and identically distributed, meaning the error terms for different observations are uncorrelated. However, real-world data often violate this independence assumption due to inherent dependencies among observations.[1] Common forms of dependence include spatial correlation, where observations in geographic proximity exhibit correlated errors, such as property values in neighboring areas influenced by shared local factors like zoning policies. Temporal correlation arises in time series data, where past errors predict future ones, as seen in economic indicators like GDP growth that carry over shocks from previous periods. Group-level correlation occurs when units within the same cluster, such as employees' wages within a firm or students in a classroom, share unobserved shocks, like industry-wide policy changes affecting firm-level decisions.[8][9][10] Ignoring these dependencies leads to severe inferential problems. Standard errors are typically underestimated, resulting in inflated t-statistics and a higher likelihood of falsely rejecting null hypotheses, which can produce spurious evidence of statistical significance. For instance, in simulations with clustered data, failure to account for within-group correlation can increase Type I error rates to nearly 50%, compared to the nominal 5%.[1][11] Data structures particularly prone to such violations include panel data, which track the same units over time and often feature both temporal and group dependencies; clustered sampling designs, where sampling occurs in groups like households or firms, inducing intra-cluster correlation; and experimental setups with grouped treatments, such as randomized trials at the school level where student outcomes correlate within schools. Empirical evidence from labor economics underscores these issues: in analyses of firm-clustered wage data, unadjusted OLS standard errors can be biased downward by factors of 3 to 10, as demonstrated in studies using matched employer-employee datasets, leading to overstated precision in estimates of labor market effects.[1][12][13]Motivation for Clustering
Intuitive Rationale
In regression analysis, observations are often not fully independent due to grouping structures in the data, such as shared environmental or social factors that influence outcomes similarly within groups. An intuitive analogy for this is comparing independent observations to individual apples picked randomly from various trees, where each apple's quality varies independently; in contrast, clustered observations resemble apples harvested from the same orchard, where a weather shock like a late frost affects all apples in that group similarly, leading to correlated qualities across them but independence between orchards. This grouping violates the assumption of independent errors in standard regression models, as unmeasured factors (e.g., regional climate) create within-group correlations.[14] Real-world data frequently exhibits such clustering; for instance, in educational surveys, student test scores within the same classroom may correlate due to shared unobservables like teacher quality or classroom resources, while scores across different classrooms are more independent. Similarly, household survey responses on income or health can cluster by family unit, where siblings' outcomes are influenced by common factors such as parental education or neighborhood effects. Ignoring these dependencies treats all observations as equally informative, which overstates the precision of estimates.[1] Clustered standard errors address this by adjusting for within-cluster similarities, effectively treating each cluster as a single unit for variance estimation rather than each observation independently; this widens the standard errors to reflect the true uncertainty, preventing overconfidence in statistical significance and reducing the risk of false positives in hypothesis tests. In the student test score example, suppose a regression estimates the effect of study hours on exam performance across 50 students in 5 classrooms; without clustering, the standard error on the coefficient might be unrealistically narrow, yielding a tight 95% confidence interval and a misleadingly low p-value, suggesting strong evidence of an effect. With clustering by classroom, the standard error inflates, producing a wider interval and a higher p-value, better capturing the reduced information from correlated errors within groups.[15][1]Statistical Foundations
Clustered standard errors arise within the framework of robust inference for regression models where observations are not independent, particularly when data exhibit dependence within predefined groups or clusters. The sandwich estimator, originally developed for heteroskedasticity-consistent covariance matrices, forms the core of this approach by providing a general structure for estimating the variance-covariance matrix of parameter estimates that remains consistent under violations of standard assumptions.[16] Clustering matters statistically because unobserved correlations within clusters lead to an understatement of the true variability in parameter estimates if conventional standard errors are used; this intra-cluster dependence effectively reduces the number of independent observations, increasing the effective variance and necessitating adjustments to the covariance matrix. The cluster-robust version of the sandwich estimator imposes a block-diagonal structure on the middle component of the covariance matrix, assuming independence across clusters while allowing arbitrary dependence within them, which captures this group-level correlation without requiring a fully specified dependence model.[17][1] These corrected standard errors are essential for valid hypothesis testing, as they ensure that t-tests, F-tests, and confidence intervals maintain their nominal coverage probabilities under group dependence, preventing inflated Type I error rates that would otherwise occur with naive standard errors. Extensions of the Central Limit Theorem for clustered data underpin this validity, justifying asymptotic normality of the estimators when the number of clusters grows large, even with fixed or growing cluster sizes. This theoretical foundation, formalized in early work on cluster-robust inference, supports the widespread use of these methods in empirical research.[17][1]Mathematical Formulation
Model Assumptions and Clustering
The standard linear regression model for clustered data is given by , where is an vector of outcomes, is an matrix of regressors, is a vector of parameters, and is an vector of errors, with the assumption that .[1] Unlike the classical model, the variance-covariance matrix is permitted to exhibit a block-diagonal structure corresponding to clusters, allowing for arbitrary correlations within clusters while assuming independence across them. This setup relaxes the classical assumption of full independence across all observations, replacing it with the condition that errors are independent between clusters but may be arbitrarily correlated within each cluster .[1] Specifically, for observations and in different clusters , , whereas within-cluster covariances can take any form, including heteroskedasticity or serial correlation. Clusters are defined as groups indexed by to , each containing observations, where the total sample size , and clusters often represent natural groupings such as firms, schools, or geographic units.[1] The ordinary least squares (OLS) estimator remains unbiased and consistent under the maintained exogeneity assumption , but its efficiency is generally lower due to within-cluster dependence, necessitating adjusted standard errors for valid inference. Without such adjustments, conventional standard errors underestimate the true variability, leading to overly narrow confidence intervals and inflated test statistics.[1]Variance-Covariance Matrix Derivation
The cluster-robust variance-covariance matrix for the ordinary least squares (OLS) estimator in a linear regression model accounts for potential correlation of errors within predefined clusters while assuming independence across clusters. This estimator takes the form of a sandwich covariance matrix, , where is the design matrix, and the "meat" is constructed as . Here, the data are partitioned into clusters, with and denoting the submatrices of regressors and OLS residuals corresponding to cluster , respectively. This form extends the heteroskedasticity-consistent estimator of White (1980) to allow for intra-cluster dependence, as originally proposed by Liang and Zeger (1986) for generalized estimating equations and adapted to OLS in econometric applications. To derive this estimator, begin with the population asymptotic variance of . Under standard OLS assumptions except for error independence—specifically, and , where is block-diagonal with cluster-specific blocks —the variance is . Normalizing by sample size for asymptotics, , where , with and . The influence function for the OLS estimator aggregates to the cluster level: for each observation in cluster , the contribution is , so the cluster-level influence is . Thus, captures the variance of these cluster influences, assuming independence across . The cluster-robust estimator replaces the expectations with sample analogs using OLS residuals , yielding , which is consistent under the stated assumptions.[1] For finite samples, particularly when the number of clusters is small, the conventional cluster-robust estimator can understate the variance due to downward bias in , leading to overstated precision. Adjustments mitigate this; for example, a simple degrees-of-freedom correction multiplies by , or more sophisticated versions like the CR2 estimator of Bell and McCaffrey (2002) scale residuals upward using , where is the within-cluster projection matrix. Alternatively, use degrees of freedom for t-tests in place of when is small. Asymptotically, the cluster-robust estimator is consistent if the number of clusters , regardless of whether individual cluster sizes grow, provided the usual regularity conditions hold (e.g., bounded moments and no dominant clusters). Consistency also obtains under fixed and for each , as the intra-cluster dependence is consistently estimated within blocks, though the case is more standard in econometric applications with many small clusters. These properties ensure valid inference even when errors are arbitrarily correlated within clusters.[1]Estimation and Implementation
Defining Clusters
In clustered standard errors, clusters are defined based on the underlying source of correlation among observations, typically arising from shared unobservables such as geographic proximity, temporal dependencies, or organizational structures that induce within-group error correlations.[18][1] For instance, observations may be clustered by geography (e.g., states or villages) when policy interventions affect entire regions similarly, by time periods (e.g., years) when macroeconomic shocks influence multiple units contemporaneously, or by organizations (e.g., firms or households) when internal factors create persistent dependencies.[18][19] Common cluster levels vary by field and data structure. In corporate finance, firm-level clustering is standard to account for time-series correlations within firms due to unobserved firm-specific effects.[19] In policy analysis, state-level clustering often captures geographic spillovers or common regulatory environments.[1] Multi-level clustering is appropriate for hierarchical data, such as students nested within classrooms and classrooms within schools, where errors correlate at multiple tiers due to shared instructional or institutional factors.[1] Practitioners should follow rules of thumb to ensure reliable inference. A minimum of 50 clusters is generally recommended to approximate asymptotic properties and avoid severe downward bias in standard errors, though fewer may suffice in balanced designs with low intra-cluster correlation.[1] Over-clustering, such as defining clusters at highly aggregated levels with too few effective groups (e.g., fewer than 20), can lead to conservative estimates and reduced power; instead, cluster at the level where correlations are strongest without unnecessarily coarsening the data.[19][1] In software implementations, cluster variables are specified to compute the variance-covariance matrix accordingly. In Stata, thecluster() option in regress or xtreg allows one-way clustering by a variable (e.g., regress y x, cluster(firm_id)), while multi-way options like cgmreg handle multiple dimensions.[1] In R, the sandwich package's vcovCL() function enables clustered variance estimation, as in vcovCL(lm(y ~ x), cluster = ~firm_id).[1] The choice of cluster definition directly influences the estimated variance matrix, underscoring the need for theoretically motivated selection.[18]
Computational Algorithms
The computation of clustered standard errors typically follows the sandwich estimator framework, adapted for clustering. First, ordinary least squares (OLS) regression is fitted to the data to obtain the parameter estimates and residuals for each observation . Second, cluster-specific contributions to the variance are calculated by computing the outer products of the score vectors (or estimating functions) within each cluster , such as for cluster , and then forming the cluster-summed outer product . Third, these cluster-specific outer products are summed across all clusters to form the "meat" matrix , which is then sandwiched between two "bread" matrices (the inverse of the unscaled OLS variance-covariance matrix) to yield the robust variance-covariance matrix . Finite-sample bias corrections, such as multiplying the meat by , are often applied to improve small-cluster performance.[1] Pseudocode for implementing the one-way clustered sandwich estimator in a matrix-oriented language like R or Python might proceed as follows:# Assume data: y (n x 1), X (n x k, including intercept), cluster_ids (n x 1 vector of cluster labels)
# Step 1: Fit OLS
beta_hat = solve(t(X) %*% X) %*% t(X) %*% y # or use lm/OLS solver
epsilon_hat = y - X %*% beta_hat
# Step 2: Compute cluster-specific sums of outer products
unique_clusters = unique(cluster_ids)
G = length(unique_clusters)
meat = matrix(0, k, k)
for(g in unique_clusters) {
idx_g = which(cluster_ids == g)
X_g = X[idx_g, ]
eps_g = epsilon_hat[idx_g]
score_g = colSums(X_g * eps_g) # k x 1 sum
meat = meat + tcrossprod(score_g) # outer product
}
meat = meat * (G / (G - 1)) # bias correction
# Step 3: Sandwich
bread = solve(t(X) %*% X)
vcov_cl = bread %*% [meat](/page/Meat) %*% bread
se_cl = sqrt(diag(vcov_cl))
# Assume data: y (n x 1), X (n x k, including intercept), cluster_ids (n x 1 vector of cluster labels)
# Step 1: Fit OLS
beta_hat = solve(t(X) %*% X) %*% t(X) %*% y # or use lm/OLS solver
epsilon_hat = y - X %*% beta_hat
# Step 2: Compute cluster-specific sums of outer products
unique_clusters = unique(cluster_ids)
G = length(unique_clusters)
meat = matrix(0, k, k)
for(g in unique_clusters) {
idx_g = which(cluster_ids == g)
X_g = X[idx_g, ]
eps_g = epsilon_hat[idx_g]
score_g = colSums(X_g * eps_g) # k x 1 sum
meat = meat + tcrossprod(score_g) # outer product
}
meat = meat * (G / (G - 1)) # bias correction
# Step 3: Sandwich
bread = solve(t(X) %*% X)
vcov_cl = bread %*% [meat](/page/Meat) %*% bread
se_cl = sqrt(diag(vcov_cl))
by or tapply in R), and aggregate outer products in a single pass, reducing time complexity from to near where is observations and parameters. Packages like R's sandwich leverage these techniques, supporting parallelization for bootstrap procedures.[21]
