Hubbry Logo
Generalized estimating equationGeneralized estimating equationMain
Open search
Generalized estimating equation
Community hub
Generalized estimating equation
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Generalized estimating equation
Generalized estimating equation
from Wikipedia

In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unmeasured correlation between observations from different timepoints.[1][2]

Regression beta coefficient estimates from the Liang-Zeger GEE are consistent, unbiased, and asymptotically normal even when the working correlation is misspecified, under mild regularity conditions. GEE is higher in efficiency than generalized linear models (GLMs) in the presence of high autocorrelation.[1] When the true working correlation is known, consistency does not require the assumption that missing data is missing completely at random.[1] Huber-White standard errors improve the efficiency of Liang-Zeger GEE in the absence of serial autocorrelation but may remove the marginal interpretation. GEE estimates the average response over the population ("population-averaged" effects) with Liang-Zeger standard errors, and in individuals using Huber-White standard errors, also known as "robust standard error" or "sandwich variance" estimates.[3] Huber-White GEE was used since 1997, and Liang-Zeger GEE dates to the 1980s based on a limited literature review.[4] Several independent formulations of these standard error estimators contribute to GEE theory. Placing the independent standard error estimators under the umbrella term "GEE" may exemplify abuse of terminology.

GEEs belong to a class of regression techniques that are referred to as semiparametric because they rely on specification of only the first two moments. They are a popular alternative to the likelihood-based generalized linear mixed model which is more at risk for consistency loss at variance structure specification.[5] The trade-off of variance-structure misspecification and consistent regression coefficient estimates is loss of efficiency, yielding inflated Wald test p-values as a result of higher variance of standard errors than that of the most optimal.[6] They are commonly used in large epidemiological studies, especially multi-site cohort studies, because they can handle many types of unmeasured dependence between outcomes.

Formulation

[edit]

Given a mean model for subject and time that depends upon regression parameters , and variance structure, , the estimating equation is formed via:[7]

The parameters are estimated by solving and are typically obtained via the Newton–Raphson algorithm. The variance structure is chosen to improve the efficiency of the parameter estimates. The Hessian of the solution to the GEEs in the parameter space can be used to calculate robust standard error estimates. The term "variance structure" refers to the algebraic form of the covariance matrix between outcomes, Y, in the sample. Examples of variance structure specifications include independence, exchangeable, autoregressive, stationary m-dependent, and unstructured. The most popular form of inference on GEE regression parameters is the Wald test using naive or robust standard errors, though the Score test is also valid and preferable when it is difficult to obtain estimates of information under the alternative hypothesis. The likelihood ratio test is not valid in this setting because the estimating equations are not necessarily likelihood equations. Model selection can be performed with the GEE equivalent of the Akaike Information Criterion (AIC), the quasi-likelihood under the independence model criterion (QIC).[8]

Relationship with Generalized Method of Moments

[edit]

The generalized estimating equation is a special case of the generalized method of moments (GMM).[9] This relationship is immediately obvious from the requirement that the score function satisfy the equation:

Computation

[edit]

Software for solving generalized estimating equations is available in MATLAB,[10] SAS (proc genmod[11]), SPSS (the gee procedure[12]), Stata (the xtgee command[13]), R (packages glmtoolbox,[14] gee,[15] geepack[16] and multgee[17]), Julia (package GEE.jl[18]) and Python (package statsmodels[19]).

Comparisons among software packages for the analysis of binary correlated data [20][21] and ordinal correlated data[22] via GEE are available.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Generalized estimating equations (GEE) are a class of estimating equations that extend generalized linear models to the analysis of correlated data, such as longitudinal or clustered observations, by providing consistent estimates of regression and their variances while treating the correlation structure as a parameter without needing to specify the full joint distribution of the data. Originally proposed by Kung-Yee Liang and Scott L. Zeger in their 1986 paper "Longitudinal Data Analysis Using Generalized Linear Models," GEE address the challenges of dependent observations by focusing on marginal means and using a working matrix to improve efficiency over independent working assumptions. The method yields asymptotically normal estimates that remain consistent even if the specified structure is misspecified, provided the mean model is correctly formulated, and it employs robust "sandwich" variance estimators to account for this robustness. GEE are particularly valued in biomedical and for handling repeated measures data, such as patient outcomes over time or clustered trial data, where traditional independent models fail to capture intra-subject dependencies. Key features include support for various correlation structures—like independence, exchangeable, autoregressive, and m-dependence—and applicability to non-Gaussian outcomes via appropriate link functions, reducing to for multivariate Gaussian data. Advantages encompass ease of compared to mixed-effects models for population-averaged effects and robustness to correlation misspecification, though limitations involve potential inefficiency in estimating correlation parameters and challenges with small sample sizes or informative cluster sizes. Further developments include correlation structure selection using criteria like quasi-likelihood under the independence model (QIC) and corrected QIC (CIC), sample size and power calculations tailored for correlated designs, extensions to handle informative cluster sizes through weighted or modified GEE approaches, grouped GEE for longitudinal analysis (as of 2022), and applications to nonlinear repeated measures in (as of 2024). Overall, GEE remain a cornerstone method in statistical of dependent data, balancing simplicity, robustness, and interpretability across diverse fields including , , and social sciences.

Introduction

Definition and motivation

Generalized estimating equations (GEEs) constitute a semi-parametric statistical framework for estimating population-averaged regression parameters in the context of correlated , extending the paradigm without necessitating a complete specification of the joint distribution of the responses. This approach allows for the analysis of marginal means while accounting for within-cluster dependencies, making it particularly suitable for non-independent observations. The primary motivation for GEEs arises from the limitations of traditional generalized linear models (GLMs), which assume among observations and thus yield consistent but inefficient estimates with incorrect (typically understated) standard errors when applied to data exhibiting correlation, such as repeated measures in longitudinal studies or clustered sampling designs. In fields like and biomedical research, where data often involve multiple observations per subject—such as readings over time or outcomes from family members—ignoring this dependence can lead to understated variability and overly narrow confidence intervals. GEEs address this by incorporating a working correlation matrix to adjust for intra-cluster correlations, enabling reliable on average effects across the population. A core advantage of GEEs lies in their robustness: even if the specified correlation structure is misspecified, the parameter estimates for the marginal means remain consistent, though standard errors may require robust variance adjustments for validity. Unlike fully parametric mixed-effects models, which emphasize subject-specific interpretations by conditioning on random effects to capture individual heterogeneity, GEEs prioritize population-averaged effects that directly inform policy or decisions without modeling the conditional distribution.

Historical background

The generalized estimating equation (GEE) approach was first introduced by Kung-Yee Liang and Scott L. Zeger in their seminal 1986 paper, which extended generalized linear models to handle correlated longitudinal data in and . This development was motivated by the need for robust methods to analyze repeated measures from clinical studies, where observations within subjects are correlated but full specification of the joint distribution is often impractical. The method built on earlier quasi-likelihood techniques proposed by Nelder and Wedderburn in 1972, adapting them to account for within-subject dependencies without assuming a complete likelihood. Early extensions followed quickly, with Zeger and Liang's 1986 work in applying GEEs to both discrete and continuous outcomes, and Zeger's 1988 paper further refining applications for and count data in correlated settings. These contributions solidified GEEs as a key tool in epidemiological research, emphasizing population-averaged interpretations and robust variance estimation to address model misspecification. By the , GEEs gained practical traction through integration into statistical software, notably SAS's PROC GENMOD, which implemented the approach starting around 1993 to facilitate analysis of clustered and longitudinal data. Recognition in influential texts, such as Diggle, Liang, and Zeger's 1994 book Analysis of Longitudinal Data, further popularized the method by providing comprehensive guidance on its use alongside random effects models. Subsequent evolution focused on enhancing robustness and scalability, particularly for high-dimensional and contexts. Recent advancements, such as the Generalized Estimating Equations Boosting (GEEB) introduced in 2024, integrate GEEs with techniques like to handle large-scale longitudinal datasets efficiently.

Mathematical foundations

Model assumptions and setup

Generalized estimating equations (GEE) are designed for analyzing clustered or longitudinal data, where observations within the same cluster are correlated, but clusters themselves are independent. The consists of NN independent clusters, indexed by i=1,,Ni = 1, \dots, N, each containing nin_i correlated observations. For cluster ii, the response vector is Yi=(Yi1,,Yini)\mathbf{Y}_i = (Y_{i1}, \dots, Y_{in_i})^\top, an ni×1n_i \times 1 vector of outcomes, while the corresponding covariates form the Xi\mathbf{X}_i, an ni×pn_i \times p matrix of predictors. This setup accommodates various clustering mechanisms, such as repeated measures on the same subject over time or outcomes nested within groups like patients in hospitals. The responses YijY_{ij} (for observation jj in cluster ii) are univariate, with marginal means and variances specified according to frameworks, such as those for Gaussian (continuous outcomes), binomial (binary or proportion data), and Poisson (count data) distributions. The marginal mean of each response is μi=E(Yi)\boldsymbol{\mu}_i = E(\mathbf{Y}_i), related to the linear predictor via a link function: g(μi)=Xiβg(\boldsymbol{\mu}_i) = \mathbf{X}_i \boldsymbol{\beta}, where β\boldsymbol{\beta} is the p×1p \times 1 vector of regression parameters and g()g(\cdot) is a monotonic, differentiable link function (e.g., identity for Gaussian, for binomial). This extends the framework to correlated data without modeling the full joint distribution. Covariates in GEE models are treated as fixed effects only, with no random effects incorporated; the approach emphasizes marginal (population-averaged) modeling of the mean response as a function of predictors, which may be time-varying or cluster-specific. This fixed-effects focus allows estimation of average effects across the , suitable for clustered designs where interest lies in overall associations rather than subject-specific variability. Key assumptions underpin the GEE setup: the mean function E(Yi)=μi(β)E(\mathbf{Y}_i) = \boldsymbol{\mu}_i(\boldsymbol{\beta}) must be correctly specified, ensuring the linear predictor accurately captures the marginal expectations; observations are independent across clusters but correlated within; and the within-cluster correlation structure can be misspecified without invalidating the consistency of β\boldsymbol{\beta} estimates, provided a working matrix is used. Additionally, for unbiasedness, any should be missing completely at random. These assumptions enable robust even under correlation misspecification, prioritizing correct modeling over precise correlation estimation.

Quasi-likelihood framework

The quasi-likelihood approach offers a semi-parametric method for estimating parameters in models where the full probability distribution of the response variable is unspecified, relying solely on specifications for the mean and variance functions. Originally proposed by Wedderburn in 1974, it approximates the log-likelihood by defining a quasi-log-likelihood function Q(μ;y)Q(\mu; y) such that its derivative with respect to the mean μ\mu satisfies Qμ=yμV(μ)\frac{\partial Q}{\partial \mu} = \frac{y - \mu}{V(\mu)}, where V(μ)V(\mu) is the variance function. This leads to the quasi-score equations U(β)=yμ(β)V(μ)dμU(\beta) = \int \frac{y - \mu(\beta)}{V(\mu)} \, d\mu (up to a constant of integration that does not affect estimation), enabling inference without assuming a complete density form. This framework extends naturally to correlated data by incorporating a working covariance matrix Vi(α,β)V_i(\alpha, \beta) for the ii-th cluster or unit, which adjusts for dependence among observations while α\alpha parameterizes the correlation structure. The resulting estimating equations weight the residuals by the inverse of this covariance, providing a robust adjustment for within-cluster correlations without requiring the joint distribution to be fully known. In the independent case, this reduces to the standard score equations from generalized linear models. Within generalized estimating equations (GEEs), the serves as the foundational structure for deriving unbiased estimating equations that target the parameters β\beta, even under misspecification of the parameters α\alpha. This focus ensures consistent of β\beta as long as the model is correctly specified, regardless of the true underlying dependence. Compared to full maximum likelihood methods, which demand parametric assumptions for both and covariances, the quasi-score approach in GEEs enhances robustness by treating correlations as a feature specified only through a working model, thereby avoiding in β\beta estimates when the working correlation is misspecified. This property makes GEEs particularly suitable for complex correlated data where full distributional knowledge is impractical.

Formulation of estimating equations

Core estimating equations

The core estimating equations for generalized estimating equations (GEE) are given by U(β)=i=1KDiTVi1(Yiμi)=0,\mathbf{U}(\beta) = \sum_{i=1}^K \mathbf{D}_i^T \mathbf{V}_i^{-1} (\mathbf{Y}_i - \boldsymbol{\mu}_i) = \mathbf{0}, where Yi\mathbf{Y}_i is the ni×1n_i \times 1 vector of responses for cluster ii, μi=E(Yi)\boldsymbol{\mu}_i = E(\mathbf{Y}_i) is the corresponding mean vector, Di=μi/β\mathbf{D}_i = \partial \boldsymbol{\mu}_i / \partial \beta is the ni×pn_i \times p derivative matrix with respect to the p×1p \times 1 parameter vector β\beta, and Vi\mathbf{V}_i is a working covariance matrix approximating Cov(Yi)\text{Cov}(\mathbf{Y}_i). Specifically, Vi=Ai1/2R(α)Ai1/2\mathbf{V}_i = \mathbf{A}_i^{1/2} \mathbf{R}(\alpha) \mathbf{A}_i^{1/2}, where Ai=diag{V(μi1),,V(μini)}\mathbf{A}_i = \text{diag}\{V(\mu_{i1}), \dots, V(\mu_{in_i})\} is the diagonal matrix of marginal variances and R(α)\mathbf{R}(\alpha) is a working correlation matrix parameterized by α\alpha. These equations arise from a framework, where the estimating function U(β)\mathbf{U}(\beta) serves as an unbiased estimating score under correct specification of the model. The expectation E[U(β)]=i=1KDiTVi1E(Yiμi)=0E[\mathbf{U}(\beta)] = \sum_{i=1}^K \mathbf{D}_i^T \mathbf{V}_i^{-1} E(\mathbf{Y}_i - \boldsymbol{\mu}_i) = \mathbf{0} holds because E(Yiμi)=0E(\mathbf{Y}_i - \boldsymbol{\mu}_i) = \mathbf{0}, ensuring consistency of the solution β^\hat{\beta} for the marginal regression parameters β\beta regardless of the working misspecification. Solving U(β)=0\mathbf{U}(\beta) = \mathbf{0} yields estimates of the average effects of covariates on the marginal , focusing on population-averaged inferences. The GEE approach generalizes the score equations of generalized linear models (GLMs) by incorporating a working dependence structure to improve efficiency while maintaining robustness. The term "estimating equations" refers to these population moment conditions, which are solved to obtain parameter estimates without requiring a full likelihood specification. For inference, the robust variance estimator for β^\hat{\beta} takes the "sandwich" form Var(β^)=(i=1KDiTVi1Di)1(i=1KDiTVi1Cov(Yi)Vi1Di)(i=1KDiTVi1Di)1,\text{Var}(\hat{\beta}) = \left( \sum_{i=1}^K \mathbf{D}_i^T \mathbf{V}_i^{-1} \mathbf{D}_i \right)^{-1} \left( \sum_{i=1}^K \mathbf{D}_i^T \mathbf{V}_i^{-1} \text{Cov}(\mathbf{Y}_i) \mathbf{V}_i^{-1} \mathbf{D}_i \right) \left( \sum_{i=1}^K \mathbf{D}_i^T \mathbf{V}_i^{-1} \mathbf{D}_i \right)^{-1}, where the middle term captures the true covariance, often empirically estimated as i=1KDiTVi1(Yiμi)(Yiμi)TVi1Di\sum_{i=1}^K \mathbf{D}_i^T \mathbf{V}_i^{-1} (\mathbf{Y}_i - \boldsymbol{\mu}_i)(\mathbf{Y}_i - \boldsymbol{\mu}_i)^T \mathbf{V}_i^{-1} \mathbf{D}_i to provide valid standard errors even under correlation misspecification. This structure ensures the method's robustness for correlated data analysis.

Working correlation structures

In the generalized estimating equations (GEE) framework, the working correlation structure specifies the form of the working matrix R(α)R(\alpha), which approximates the among repeated measures or clustered observations within subjects. This matrix is parameterized by α\alpha and incorporated into the estimating equations to improve the efficiency of regression parameter estimates, though the choice of structure does not affect the consistency of these estimates if the mean model is correctly specified. Several common working correlation structures are available, each suited to different data characteristics. The structure sets R(α)=IR(\alpha) = I, the , assuming no between observations within a cluster; this simplifies GEE to standard generalized linear models and is useful for robustness checks when correlations are weak or unknown, but it can lead to inefficient estimates if dependencies exist (e.g., relative drops to about 0.74 for strong correlations of 0.7). The exchangeable (or compound symmetry) structure assumes a constant off-diagonal α\alpha for all pairs within a cluster, making it appropriate for where correlations are roughly , such as in simple clustered designs; it performs well when cluster sizes are equal across subjects but loses some (around 0.82) with highly variable cluster sizes (e.g., 1 to 8 observations). For time-ordered like longitudinal studies, the autoregressive order 1 (AR(1)) structure models correlations that decay exponentially with time lag, where the correlation between observations at times tkt_k and tlt_l is αkl\alpha^{|k-l|}; this captures serial dependence effectively and retains high (e.g., 0.98 for α=0.9\alpha = 0.9) even under varying cluster sizes. The unstructured structure estimates distinct correlations for every unique pair of time points, providing the most flexible approximation without imposing a specific pattern; while it yields the highest efficiency when the true correlations match, it requires estimating many parameters (up to 12n(n1)\frac{1}{2} n (n-1) for nn time points) and is computationally intensive, limiting its use to small clusters. Selection of an appropriate working correlation structure depends on the data type and empirical fit, often guided by information criteria such as the quasi-likelihood under the independence model criterion (QIC), which penalizes overly complex structures while rewarding better approximation of the correlations. Misspecification of R(α)R(\alpha) reduces the efficiency of the β\beta estimates but maintains their consistency and the robustness of the sandwich variance estimator.

Estimation procedures

Iterative solution methods

The estimating equations for the regression parameters in generalized estimating equations (GEE) are typically solved using a modified Fisher scoring algorithm, which iteratively updates the parameter vector β\beta. The update rule is given by β(k+1)=β(k)+[i=1NDiTVi1Di]1i=1NDiTVi1(Yiμi),\beta^{(k+1)} = \beta^{(k)} + \left[ \sum_{i=1}^N D_i^T V_i^{-1} D_i \right]^{-1} \sum_{i=1}^N D_i^T V_i^{-1} (Y_i - \mu_i), where Di=μi/βTD_i = \partial \mu_i / \partial \beta^T is the for cluster ii, ViV_i is the working , and μi\mu_i is the vector. This approach leverages the expected information matrix and extends the method from generalized linear models. Under suitable initialization and when the solution is close to the true parameters, the algorithm exhibits quadratic convergence. Initialization commonly starts with a fit assuming an independent working structure (i.e., α=0\alpha = 0), yielding initial β\beta estimates, after which the parameters α\alpha and dispersion ϕ\phi are updated iteratively using moment-based estimators. Convergence is monitored via criteria such as the in β\beta between iterations falling below a threshold (e.g., 10610^{-6}) or the norm of the score equations DiTVi1(Yiμi)\sum D_i^T V_i^{-1} (Y_i - \mu_i) being sufficiently small; if non-convergence occurs, modified weighting schemes can be applied to stabilize the iterations. For unstructured working matrices, the is O(i=1Nni3)O\left( \sum_{i=1}^N n_i^3 \right) due to the need to invert ni×nin_i \times n_i matrices for each cluster, though this is often mitigated in large datasets by adopting exchangeable, autoregressive, or other low-rank approximations that reduce inversion costs to O(ni)O(n_i).

Variance estimation

In generalized estimating equations (GEE), the model-based variance estimator for the parameter estimates β^\hat{\beta} assumes the working covariance matrix ViV_i is correctly specified and is approximated by the inverse of the weighted sum of the derivatives: Var^(β^)(i=1nDiTVi1Di)1,\widehat{\mathrm{Var}}(\hat{\beta}) \approx \left( \sum_{i=1}^n D_i^T V_i^{-1} D_i \right)^{-1},
Add your contribution
Related Hubs
User Avatar
No comments yet.