Hubbry Logo
Generalized linear modelGeneralized linear modelMain
Open search
Generalized linear model
Community hub
Generalized linear model
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Generalized linear model
Generalized linear model
from Wikipedia

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Generalized linear models were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including linear regression, logistic regression and Poisson regression.[1] They proposed an iteratively reweighted least squares method for maximum likelihood estimation (MLE) of the model parameters. MLE remains popular and is the default method on many statistical computing packages. Other approaches, including Bayesian regression and least squares fitting to variance stabilized responses, have been developed.

Intuition

[edit]

Ordinary linear regression predicts the expected value of a given unknown quantity (the response variable, a random variable) as a linear combination of a set of observed values (predictors). This implies that a constant change in a predictor leads to a constant change in the response variable (i.e. a linear-response model). This is appropriate when the response variable can vary, to a good approximation, indefinitely in either direction, or more generally for any quantity that only varies by a relatively small amount compared to the variation in the predictive variables, e.g. human heights.

However, these assumptions are inappropriate for some types of response variables. For example, in cases where the response variable is expected to be always positive and varying over a wide range, constant input changes lead to geometrically (i.e. exponentially) varying, rather than constantly varying, output changes. As an example, suppose a linear prediction model learns from some data (perhaps primarily drawn from large beaches) that a 10 degree temperature decrease would lead to 1,000 fewer people visiting the beach. This model is unlikely to generalize well over differently-sized beaches. More specifically, the problem is that if the model is used to predict the new attendance with a temperature drop of 10 for a beach that regularly receives 50 beachgoers, it would predict an impossible attendance value of −950. Logically, a more realistic model would instead predict a constant rate of increased beach attendance (e.g. an increase of 10 degrees leads to a doubling in beach attendance, and a drop of 10 degrees leads to a halving in attendance). Such a model is termed an exponential-response model (or log-linear model, since the logarithm of the response is predicted to vary linearly).

Similarly, a model that predicts a probability of making a yes/no choice (a Bernoulli variable) is even less suitable as a linear-response model, since probabilities are bounded on both ends (they must be between 0 and 1). Imagine, for example, a model that predicts the likelihood of a given person going to the beach as a function of temperature. A reasonable model might predict, for example, that a change in 10 degrees makes a person two times more or less likely to go to the beach. But what does "twice as likely" mean in terms of a probability? It cannot literally mean to double the probability value (e.g. 50% becomes 100%, 75% becomes 150%, etc.). Rather, it is the odds that are doubling: from 2:1 odds, to 4:1 odds, to 8:1 odds, etc. Such a model is a log-odds or logistic model.

Generalized linear models cover all these situations by allowing for response variables that have arbitrary distributions (rather than simply normal distributions), and for an arbitrary function of the response variable (the link function) to vary linearly with the predictors (rather than assuming that the response itself must vary linearly). For example, the case above of predicted number of beach attendees would typically be modeled with a Poisson distribution and a log link, while the case of predicted probability of beach attendance would typically be modelled with a Bernoulli distribution (or binomial distribution, depending on exactly how the problem is phrased) and a log-odds (or logit) link function.

Overview

[edit]

In a generalized linear model (GLM), each outcome Y of the dependent variables is assumed to be generated from a particular distribution in an exponential family, a large class of probability distributions that includes the normal, binomial, Poisson and gamma distributions, among others. The conditional mean μ of the distribution depends on the independent variables X through:

where E(Y | X) is the expected value of Y conditional on X; Xβ is the linear predictor, a linear combination of unknown parameters β; g is the link function.

In this framework, the variance is typically a function, V, of the mean:

It is convenient if V follows from an exponential family of distributions, but it may simply be that the variance is a function of the predicted value.

The unknown parameters, β, are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian techniques.

Model components

[edit]

The GLM consists of three elements:

1. A particular distribution for modeling from among those which are considered exponential families of probability distributions,
2. A linear predictor , and
3. A link function such that .

Probability distribution

[edit]

An overdispersed exponential family of distributions is a generalization of an exponential family and the exponential dispersion model of distributions and includes those families of probability distributions, parameterized by and , whose density functions f (or probability mass function, for the case of a discrete distribution) can be expressed in the form

The dispersion parameter, , typically is known and is usually related to the variance of the distribution. The functions , , , , and are known. Many common distributions are in this family, including the normal, exponential, gamma, Poisson, Bernoulli, and (for fixed number of trials) binomial, multinomial, and negative binomial.

For scalar and (denoted and in this case), this reduces to

is related to the mean of the distribution. If is the identity function, then the distribution is said to be in canonical form (or natural form). Note that any distribution can be converted to canonical form by rewriting as and then applying the transformation . It is always possible to convert in terms of the new parametrization, even if is not a one-to-one function; see comments in the page on exponential families.

If, in addition, and are the identity, then is called the canonical parameter (or natural parameter) and is related to the mean through

For scalar and , this reduces to

Under this scenario, the variance of the distribution can be shown to be[2]

For scalar and , this reduces to

Linear predictor

[edit]

The linear predictor is the quantity which incorporates the information about the independent variables into the model. The symbol η (Greek "eta") denotes a linear predictor. It is related to the expected value of the data through the link function.

η is expressed as linear combinations (thus, "linear") of unknown parameters β. The coefficients of the linear combination are represented as the matrix of independent variables X. η can thus be expressed as

[edit]

The link function provides the relationship between the linear predictor and the mean of the distribution function. There are many commonly used link functions, and their choice is informed by several considerations. There is always a well-defined canonical link function which is derived from the exponential of the response's density function. However, in some cases it makes sense to try to match the domain of the link function to the range of the distribution function's mean, or use a non-canonical link function for algorithmic purposes, for example Bayesian probit regression.

When using a distribution function with a canonical parameter the canonical link function is the function that expresses in terms of i.e. For the most common distributions, the mean is one of the parameters in the standard form of the distribution's density function, and then is the function as defined above that maps the density function into its canonical form. When using the canonical link function, which allows to be a sufficient statistic for .

Following is a table of several exponential-family distributions in common use and the data they are typically used for, along with the canonical link functions and their inverses (sometimes referred to as the mean function, as done here).

Common distributions with typical uses and canonical link functions
Distribution Support of distribution Typical uses Link name Link function, Mean function
Normal real: Linear-response data Identity
Laplace
Exponential real: Exponential-response data, scale parameters Negative inverse
Gamma
Inverse
Gaussian
real: Inverse
squared
Poisson integer: count of occurrences in fixed amount of time/space Log
Bernoulli integer: outcome of single yes/no occurrence Logit
Binomial integer: count of # of "yes" occurrences out of N yes/no occurrences
Categorical integer: outcome of single K-way occurrence
K-vector of integer: , where exactly one element in the vector has the value 1
Multinomial K-vector of integer: count of occurrences of different types (1, ..., K) out of N total K-way occurrences

In the cases of the exponential and gamma distributions, the domain of the canonical link function is not the same as the permitted range of the mean. In particular, the linear predictor may be positive, which would give an impossible negative mean. When maximizing the likelihood, precautions must be taken to avoid this. An alternative is to use a noncanonical link function.

In the case of the Bernoulli, binomial, categorical and multinomial distributions, the support of the distributions is not the same type of data as the parameter being predicted. In all of these cases, the predicted parameter is one or more probabilities, i.e. real numbers in the range . The resulting model is known as logistic regression (or multinomial logistic regression in the case that K-way rather than binary values are being predicted).

For the Bernoulli and binomial distributions, the parameter is a single probability, indicating the likelihood of occurrence of a single event. The Bernoulli still satisfies the basic condition of the generalized linear model in that, even though a single outcome will always be either 0 or 1, the expected value will nonetheless be a real-valued probability, i.e. the probability of occurrence of a "yes" (or 1) outcome. Similarly, in a binomial distribution, the expected value is Np, i.e. the expected proportion of "yes" outcomes will be the probability to be predicted.

For categorical and multinomial distributions, the parameter to be predicted is a K-vector of probabilities, with the further restriction that all probabilities must add up to 1. Each probability indicates the likelihood of occurrence of one of the K possible values. For the multinomial distribution, and for the vector form of the categorical distribution, the expected values of the elements of the vector can be related to the predicted probabilities similarly to the binomial and Bernoulli distributions.

Fitting

[edit]

Maximum likelihood

[edit]

The maximum likelihood estimates can be found using an iteratively reweighted least squares algorithm or a Newton's method with updates of the form:

where is the observed information matrix (the negative of the Hessian matrix) and is the score function; or a Fisher's scoring method:

where is the Fisher information matrix. Note that if the canonical link function is used, then they are the same.[3]

Bayesian methods

[edit]

In general, the posterior distribution cannot be found in closed form and so must be approximated, usually using Laplace approximations or some type of Markov chain Monte Carlo method such as Gibbs sampling.

Examples

[edit]

General linear models

[edit]

A possible point of confusion has to do with the distinction between generalized linear models and general linear models, two broad statistical models. Co-originator John Nelder has expressed regret over this terminology.[4]

The general linear model may be viewed as a special case of the generalized linear model with identity link and responses normally distributed. As most exact results of interest are obtained only for the general linear model, the general linear model has undergone a somewhat longer historical development. Results for the generalized linear model with non-identity link are asymptotic (tending to work well with large samples).

Linear regression

[edit]

A simple, very important example of a generalized linear model (also an example of a general linear model) is linear regression. In linear regression, the use of the least-squares estimator is justified by the Gauss–Markov theorem, which does not assume that the distribution is normal.

From the perspective of generalized linear models, however, it is useful to suppose that the distribution function is the normal distribution with constant variance and the link function is the identity, which is the canonical link if the variance is known. Under these assumptions, the least-squares estimator is obtained as the maximum-likelihood parameter estimate.

For the normal distribution, the generalized linear model has a closed form expression for the maximum-likelihood estimates, which is convenient. Most other GLMs lack closed form estimates.

Binary data

[edit]

When the response data, Y, are binary (taking on only values 0 and 1), the distribution function is generally chosen to be the Bernoulli distribution and the interpretation of μi is then the probability, p, of Yi taking on the value one.

There are several popular link functions for binomial functions.

[edit]

The most typical link function is the canonical logit link:

GLMs with this setup are logistic regression models (or logit models).

[edit]

Alternatively, the inverse of any continuous cumulative distribution function (CDF) can be used for the link since the CDF's range is , the range of the binomial mean. The normal CDF is a popular choice and yields the probit model. Its link is

The reason for the use of the probit model is that a constant scaling of the input variable to a normal CDF (which can be absorbed through equivalent scaling of all of the parameters) yields a function that is practically identical to the logit function, but probit models are more tractable in some situations than logit models. (In a Bayesian setting in which normally distributed prior distributions are placed on the parameters, the relationship between the normal priors and the normal CDF link function means that a probit model can be computed using Gibbs sampling, while a logit model generally cannot.)

Complementary log-log (cloglog)

[edit]

The complementary log-log function may also be used:

This link function is asymmetric and will often produce different results from the logit and probit link functions.[5] The cloglog model corresponds to applications where we observe either zero events (e.g., defects) or one or more, where the number of events is assumed to follow the Poisson distribution.[6] The Poisson assumption means that

where μ is a positive number denoting the expected number of events. If p represents the proportion of observations with at least one event, its complement

and then

A linear model requires the response variable to take values over the entire real line. Since μ must be positive, we can enforce that by taking the logarithm, and letting log(μ) be a linear model. This produces the "cloglog" transformation

[edit]

The identity link g(p) = p is also sometimes used for binomial data to yield a linear probability model. However, the identity link can predict nonsense "probabilities" less than zero or greater than one. This can be avoided by using a transformation like cloglog, probit or logit (or any inverse cumulative distribution function). A primary merit of the identity link is that it can be estimated using linear math—and other standard link functions are approximately linear matching the identity link near p = 0.5.

Variance function

[edit]

The variance function for "quasibinomial" data is:

where the dispersion parameter τ is exactly 1 for the binomial distribution. Indeed, the standard binomial likelihood omits τ. When it is present, the model is called "quasibinomial", and the modified likelihood is called a quasi-likelihood, since it is not generally the likelihood corresponding to any real family of probability distributions. If τ exceeds 1, the model is said to exhibit overdispersion.

Multinomial regression

[edit]

The binomial case may be easily extended to allow for a multinomial distribution as the response (also, a Generalized Linear Model for counts, with a constrained total). There are two ways in which this is usually done:

Ordered response

[edit]

If the response variable is ordinal, then one may fit a model function of the form:

for m > 2. Different links g lead to ordinal regression models like proportional odds models or ordered probit models.

Unordered response

[edit]

If the response variable is a nominal measurement, or the data do not satisfy the assumptions of an ordered model, one may fit a model of the following form:

for m > 2. Different links g lead to multinomial logit or multinomial probit models. These are more general than the ordered response models, and more parameters are estimated.

Count data

[edit]

Another example of generalized linear models includes Poisson regression which models count data using the Poisson distribution. The link is typically the logarithm, the canonical link.

The variance function is proportional to the mean

where the dispersion parameter τ is typically fixed at exactly one. When it is not, the resulting quasi-likelihood model is often described as Poisson with overdispersion or quasi-Poisson.

Extensions

[edit]

Correlated or clustered data

[edit]

The standard GLM assumes that the observations are uncorrelated. Extensions have been developed to allow for correlation between observations, as occurs for example in longitudinal studies and clustered designs:

  • Generalized estimating equations (GEEs) allow for the correlation between observations without the use of an explicit probability model for the origin of the correlations, so there is no explicit likelihood. They are suitable when the random effects and their variances are not of inherent interest, as they allow for the correlation without explaining its origin. The focus is on estimating the average response over the population ("population-averaged" effects) rather than the regression parameters that would enable prediction of the effect of changing one or more components of X on a given individual. GEEs are usually used in conjunction with Huber–White standard errors.[7][8]
  • Generalized linear mixed models (GLMMs) are an extension to GLMs that includes random effects in the linear predictor, giving an explicit probability model that explains the origin of the correlations. The resulting "subject-specific" parameter estimates are suitable when the focus is on estimating the effect of changing one or more components of X on a given individual. GLMMs are also referred to as multilevel models and as mixed model. In general, fitting GLMMs is more computationally complex and intensive than fitting GEEs.

Generalized additive models

[edit]

Generalized additive models (GAMs) are another extension to GLMs in which the linear predictor η is not restricted to be linear in the covariates X but is the sum of smoothing functions applied to the xis:

The smoothing functions fi are estimated from the data. In general this requires a large number of data points and is computationally intensive.[9][10]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A generalized linear model (GLM) is a statistical modeling framework that extends the classical model to accommodate response variables whose distributions belong to the , unifying various regression techniques under a common structure. Introduced by John Nelder and Robert Wedderburn in 1972, GLMs were developed to provide a unified approach for estimating parameters and analyzing data in diverse scenarios, such as proportions, counts, and continuous positive values, where the assumptions of do not hold. This framework has three core components: a random component specifying the of the response variable (typically from the , including normal, binomial, Poisson, and gamma distributions); a systematic component consisting of a linear predictor formed as a of explanatory variables; and a link function that connects the expected value of the response to the linear predictor by transforming the mean to match the scale of the linear predictor. For instance, in —a common GLM application—the link function is the , linking the probability of a binary outcome to predictors, while uses a log link for count data. GLMs are fitted using , often via , and have become foundational in fields like , , and for handling non-normal data while maintaining interpretability of coefficients as changes in the link scale.

Introduction

Intuition

The classical model serves as a foundational special case of the generalized linear model (GLM), where the response variable is assumed to follow a and the is directly equal to a of the predictors, akin to drawing a straight line through data points to predict continuous outcomes like house prices or temperatures. This approach works well when responses vary symmetrically around the mean without bounds, but real-world data often defies such assumptions, such as when outcomes are binary (/) or counts (number of events), leading to skewed or bounded distributions that violate normality. GLMs address this limitation by incorporating a link function, which acts like a flexible translator: it connects the of predictors—think of it as a weighted sum capturing the influence of variables like age or —to the of the response in a way that respects the data's natural variability. For instance, in predicting whether a buys a product (binary response), the link function might transform the linear predictor into a probability between 0 and 1, avoiding impossible predictions like negative probabilities; similarly, for counting like visits, it ensures the is positive and aligns with count-like fluctuations. This transformation allows GLMs to handle non-normal responses without forcing awkward data manipulations, much like adjusting a 's projection to better fit irregular rather than stretching a flat map. To illustrate the core relationships, consider the following conceptual diagram outlining the data flow in a GLM: | Predictors (e.g., features like dosage, exposure) | → | Linear Predictor (weighted sum) | → | Link Function (transforms to fit response scale) | → | Mean of Response (expected value) | → | Distribution (e.g., binomial for binary, Poisson for counts) | → | Observed Response | |--------------------------------------------------|---|--------------------------------|-----------------------------------------------------|---------------------------------------------|-------------------------------------------------------------|-------------------------------------|-----------------------------------| This structure highlights how inputs systematically influence outcomes through intermediary steps tailored to the data type. A primary advantage of GLMs lies in their ability to unify diverse regression techniques—ranging from ordinary least squares for continuous data to for binaries and for counts—within a cohesive framework built around distributions, enabling consistent estimation and inference across applications. distributions provide the probabilistic foundation for modeling response variability in this setup.

Historical Development

The foundations of generalized linear models (GLMs) trace back to the early , particularly Ronald A. Fisher's work on sufficiency and in his 1922 paper, which laid the theoretical groundwork for the of distributions by characterizing distributions in a form amenable to likelihood-based inference, influencing later extensions to regression contexts. Fisher's contributions in the and emphasized and the structure of variance in these families, setting the stage for broader applications beyond normal . The formal introduction of GLMs occurred in 1972 through the seminal paper by John A. Nelder and Robert W. M. Wedderburn, titled "Generalized Linear Models," published in the Journal of the Royal Statistical Society, Series A. In this work, they proposed a class of models that extended classical to response variables following distributions, incorporating a linear predictor and a link function to connect the mean response to covariates. This framework was initially implemented in the GENSTAT statistical software package, developed at Rothamsted Experimental Station, where Nelder served as director, enabling practical computation via iterative . Building on this, Wedderburn introduced methods in 1974, relaxing the full distributional assumptions to focus on mean-variance relationships, which further broadened the applicability of GLM-like approaches to non-exponential family data. The 1980s marked significant expansions, including the development of generalized estimating equations (GEEs) by Kung-Yee Liang and Scott L. Zeger in 1986, which adapted GLM estimation for correlated data in longitudinal and clustered studies while maintaining robust inference. GLMs gained widespread adoption through integration into major statistical software: SAS introduced PROC GENMOD in version 6.09 (1993) for fitting GLMs and GEEs, while the R programming language incorporated the glm() function in its early releases during the 1990s, facilitating accessible implementation for diverse users. In the 2000s, GLMs influenced machine learning, notably through the GLMNET package (introduced in 2008), which added lasso and elastic-net regularization to GLM estimation for high-dimensional data and variable selection. These milestones underscore GLMs' evolution from theoretical construct to a cornerstone of modern statistical and computational practice.

Overview

A (GLM) provides a unified approach to by extending the classical to handle response variables that follow distributions other than , such as counts or proportions. The framework specifies that each independent response YiY_i (for i=1,,ni = 1, \dots, n) arises from a distribution in the , with its μi=E(Yi)\mu_i = E(Y_i) connected to a linear predictor ηi=xiTβ\eta_i = \mathbf{x}_i^T \boldsymbol{\beta} via a monotonic link function gg, such that g(μi)=ηig(\mu_i) = \eta_i. This structure allows the mean of the response to vary flexibly across observations while maintaining a linear relationship in the parameter space. The GLM framework rests on three fundamental assumptions: the linear predictor is linear in the unknown parameters β\boldsymbol{\beta}; the observations YiY_i are independent; and both the conditional distribution of the response and the link function are correctly specified. These assumptions enable the model to capture systematic variation in the data without requiring the stringent normality and constant variance conditions of ordinary least squares regression. Violations of these assumptions can lead to biased estimates or invalid inferences, underscoring the importance of diagnostic checks in practice. To apply a GLM, one first selects a from the that matches the nature of the response , then chooses an appropriate link function—often the canonical one associated with the distribution—to connect the mean to the predictors. Parameter estimation typically proceeds via maximum likelihood methods, followed by model assessment to evaluate goodness-of-fit and predictive performance. This facilitates the of diverse types, including Poisson-distributed counts or binomial proportions, within a consistent statistical . GLMs offer key advantages over ad-hoc or separate modeling strategies for non-normal , including the interpretability of regression coefficients on the transformed scale of the link function and a cohesive likelihood-based approach to , such as testing and intervals. This unification simplifies software implementation and theoretical development, making GLMs a of modern statistical modeling.

Core Components

Probability Distribution

In generalized linear models (GLMs), the response variable YY is assumed to follow a from the , which provides a unified framework for modeling various types of data, including continuous, discrete, and count data. This family encompasses a wide range of distributions commonly encountered in statistical applications, allowing the and variance of YY to be linked in a manner that facilitates and . The general form of a density or mass function for a random variable YY in the exponential family, as used in GLMs, is given by f(y;θ,ϕ)=exp[yθb(θ)a(ϕ)+c(y,ϕ)],f(y; \theta, \phi) = \exp\left[ \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right], where θ\theta is the natural (or canonical) parameter controlling the location, ϕ\phi is the dispersion parameter influencing the scale, b()b(\cdot) and c(,)c(\cdot, \cdot) are known functions specific to the distribution, and a()a(\cdot) is typically ϕ/w\phi / w with ww a known prior weight (often 1). This parameterization ensures that the log-likelihood contributions are linear in the sufficient statistic yy, simplifying computations in maximum likelihood estimation. A key property of this form is the relationship between the mean and variance of YY. The expected value is E(Y)=b(θ)=μ\mathbb{E}(Y) = b'(\theta) = \mu, the derivative of bb with respect to θ\theta, while the variance is Var(Y)=b(θ)a(ϕ)=V(μ)a(ϕ)\mathrm{Var}(Y) = b''(\theta) \cdot a(\phi) = V(\mu) \cdot a(\phi), where V(μ)=b(θ)V(\mu) = b''(\theta) is the variance function depending only on the mean. This structure parameterizes how variance relates to the mean, which is crucial for handling heteroscedasticity in non-normal data. Several common distributions belong to this family and are frequently used in GLMs, each with distinct mean-variance relationships suited to specific data types. The following table summarizes key examples:
DistributionNatural Parameter θ\thetaCumulant Function b(θ)b(\theta)Mean μ=b(θ)\mu = b'(\theta)Variance Function V(μ)V(\mu)Dispersion ϕ\phi Interpretation
Normalμ\muθ2/2\theta^2 / 2θ\theta1 (constant)σ2\sigma^2 (scale)
Poissonlogλ\log \lambdaexp(θ)\exp(\theta)exp(θ)\exp(\theta)μ\mu1 (no dispersion)
Binomial(n,pn, p)log(p/(1p))\log(p/(1-p))nlog(1+exp(θ))n \log(1 + \exp(\theta))n/(1+exp(θ))n / (1 + \exp(-\theta))μ(1μ/n)\mu (1 - \mu/n)1 (no dispersion)
Gamma1/μ-1/\mulog(θ)-\log(-\theta)1/θ-1/\thetaμ2\mu^2Shape parameter (scale)
For the Normal distribution, the constant variance is ideal for symmetric continuous data with homoscedasticity. The , with variance equal to the mean, suits non-negative count data where may occur if violated. The models binary or proportion data, with variance μ(1μ/n)\mu(1 - \mu/n) reflecting the bounded nature of success probabilities. The , featuring quadratic variance in the mean, is appropriate for positive continuous data like waiting times or sizes. The is central to GLMs because it admits sufficient statistics for θ\theta, reducing data dimensionality and enabling efficient iterative estimation algorithms like . It also naturally defines canonical link functions where the linear predictor directly equals θ\theta, ensuring desirable statistical properties such as the uniqueness of maximum likelihood estimates under regularity conditions. For cases where the full distribution is unspecified but the mean-variance relationship is known, methods extend the framework by using the same estimating equations without assuming the exponential form, broadening applicability to overdispersed or misspecified models.

Linear Predictor

In a generalized linear model (GLM), the linear predictor represents the systematic component that captures the effects of explanatory variables on the response, serving as the input to the link function. For the ii-th , it is expressed as ηi=β0+j=1pβjxij=Xiβ,\eta_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} = \mathbf{X}_i \boldsymbol{\beta}, where Xi=(1,xi1,,xip)\mathbf{X}_i = (1, x_{i1}, \dots, x_{ip}) is the vector of covariates (including an intercept term), and β=(β0,β1,,βp)T\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)^T is the vector of unknown coefficients. This formulation assumes a linear relationship in the parameter space on the scale defined by the link function. The coefficients βj\beta_j in the linear predictor have a straightforward interpretation: each βj\beta_j quantifies the change in ηi\eta_i associated with a one-unit increase in the corresponding covariate xijx_{ij}, holding all other covariates constant. This partial effect highlights the model's additive structure, where the total predicted value ηi\eta_i is the sum of individual contributions from each predictor. However, this interpretation applies directly on the link scale, and the link function subsequently connects ηi\eta_i to the expected response μi\mu_i. A key assumption underlying the linear predictor is on the link scale, meaning the predictors enter the model additively through their s without inherent in the parameter specification. among covariates can violate the stability of this structure by inflating the variance of estimates, leading to unreliable inferences about individual βj\beta_j, though the overall model fit may remain unaffected. To enhance interpretability, particularly for the intercept β0\beta_0, centering covariates—subtracting their means from the original values—is often recommended, as it makes β0\beta_0 represent the predicted η\eta at average covariate levels. Within GLMs, the linear predictor can be extended to accommodate more complex relationships by incorporating interaction terms (e.g., βjkxijxik\beta_{jk} x_{ij} x_{ik}) or polynomial terms (e.g., βj2xij2\beta_{j2} x_{ij}^2) as additional covariates in Xi\mathbf{X}_i, thereby allowing for non-additive or nonlinear effects on the link scale without altering the fundamental linear structure in β\boldsymbol{\beta}. These extensions maintain the model's parsimony while improving fit for data exhibiting such patterns. In generalized linear models (GLMs), the link function gg provides a monotonic, differentiable transformation that connects the of the response variable μ=E(Y)\mu = E(Y) to the linear predictor η=Xβ\eta = X\beta, defined as η=g(μ)\eta = g(\mu), or equivalently μ=g1(η)\mu = g^{-1}(\eta). This transformation ensures that μ\mu remains within the valid support of the response distribution, such as (0,1)(0, 1) for proportions in binomial models or (0,)(0, \infty) for positive continuous data in gamma models. The link function thus bridges the linear structure of the predictors to the potentially nonlinear mean-response relationship dictated by the chosen . The canonical link function holds a special status in GLMs, as it corresponds to the natural parameter of the distribution, where η=θ\eta = \theta. Formally, for distributions in the with density f(y;θ,ϕ)=exp{yθb(θ)a(ϕ)+c(y,ϕ)}f(y; \theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\}, the canonical link is g(μ)=b1(μ)g(\mu) = b'^{-1}(\mu), the inverse of the derivative of the cumulant function b(θ)b(\theta). This choice simplifies and , as it aligns the linear predictor directly with the that parameterizes the distribution's mean-variance relationship. Common examples include the identity link g(μ)=μg(\mu) = \mu for the normal distribution, the log link g(μ)=log(μ)g(\mu) = \log(\mu) for the , and the link g(μ)=log(μ1μ)g(\mu) = \log\left(\frac{\mu}{1 - \mu}\right) for the . These canonical links were central to the foundational of GLMs. Non-canonical link functions deviate from this natural parameterization, such as using the identity link with a or the link g(μ)=Φ1(μ)g(\mu) = \Phi^{-1}(\mu) for , where Φ\Phi denotes the of the standard normal distribution. While non-canonical links can offer advantages like improved model fit in specific datasets by better capturing the mean structure, they introduce challenges, including more complex computational requirements for estimation and reduced interpretability of coefficients, as the linear predictor no longer directly corresponds to the natural parameter. For instance, the link may provide a closer to underlying latent normal processes in but complicates standard deviance-based diagnostics compared to the . The justification for adopting non-canonical links depends on whether the gains in fit outweigh the added analytical effort. Selection of the link function involves a balance of theoretical, empirical, and practical considerations. Theoretically, certain links achieve variance stabilization, transforming the response to approximate constant variance across levels of the mean, which can enhance the validity of inferences; for example, the link stabilizes variance in Poisson models. Empirically, information criteria such as the (AIC) or (BIC) are used to compare models with different links by penalizing complexity while rewarding goodness-of-fit, often favoring the link that minimizes these values. Domain knowledge also plays a key role, guiding choices based on the substantive interpretation of the mean-response relationship, such as preferring the for ratios in binary outcomes.

Estimation Methods

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is the primary frequentist method for fitting generalized linear models (GLMs), providing estimates of the regression coefficients β\beta by maximizing the derived from the assumed distribution. The for the parameters β\beta is expressed as L(β)=i=1nf(yi;θ(μ(Xiβ)),ϕ),L(\beta) = \prod_{i=1}^n f(y_i; \theta(\mu(X_i \beta)), \phi), where ff denotes the probability density or mass function from the , θ\theta is the natural parameter linked to the mean μ\mu, XiX_i is the ii-th row of the , and ϕ\phi is the dispersion parameter. The corresponding log-likelihood is l(β)=i=1nyiθib(θi)a(ϕ)+c(yi,ϕ),l(\beta) = \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi), with bb and aa as and dispersion functions specific to the distribution, and cc independent of β\beta. Maximizing l(β)l(\beta) directly is often infeasible due to the nonlinearity introduced by the link function, so numerical optimization is required. The standard algorithm for MLE in GLMs is (IRLS), which iteratively solves to approximate the score equations. Starting with initial estimates of μi\mu_i (often from ordinary ), IRLS updates a working response zi=ηi+(yiμi)dηidμiz_i = \eta_i + (y_i - \mu_i) \frac{d\eta_i}{d\mu_i}, where ηi=g(μi)\eta_i = g(\mu_i) is the linear predictor, and then fits z=Xβz = X\beta by with weights wi=(dμi/dηi)2V(μi)w_i = \frac{(d\mu_i / d\eta_i)^2}{V(\mu_i)}, where VV is the variance function; the process repeats until convergence. Convergence is assessed by monitoring the relative change in β\beta or the log-likelihood, typically stopping when the maximum change is less than a tolerance like 10810^{-8} or after a fixed number of iterations to prevent non-convergence in edge cases. Once fitted, on β\beta relies on asymptotic normality of the MLE, β^N(β,I(β)1/n)\hat{\beta} \sim N(\beta, I(\beta)^{-1}/n), where the matrix is I(β)=XTWXI(\beta) = X^T W X and WW is the of weights from the final IRLS iteration. Wald tests for hypotheses like H0:βj=0H_0: \beta_j = 0 use the z=β^j/\se(β^j)z = \hat{\beta}_j / \se(\hat{\beta}_j), with standard errors \se(β^j)\se(\hat{\beta}_j) from the diagonal of I(β^)1I(\hat{\beta})^{-1}, approximating a standard normal under the null. Likelihood ratio tests for nested models compare 2[l(β^reduced)l(β^full)]χΔp2-2[l(\hat{\beta}_{\text{reduced}}) - l(\hat{\beta}_{\text{full}})] \sim \chi^2_{\Delta p}, where Δp\Delta p is the difference in parameters. For model selection and assessment, the deviance D=2[l(saturated)l(fitted)]D = 2[l(\text{saturated}) - l(\text{fitted})] quantifies the discrepancy between the fitted model and a (one per ), following approximately χnp2\chi^2_{n-p} for large samples under correct specification. The extends this for comparing non-nested models via AIC=2l(β^)+2pAIC = -2l(\hat{\beta}) + 2p, penalizing complexity to favor parsimonious fits with good predictive accuracy. , where observed variability exceeds that implied by the model (e.g., variance > mean for Poisson), is detected by testing if the scaled deviance ϕ^=D/(np)1\hat{\phi} = D / (n - p) \approx 1; values ϕ^>1\hat{\phi} > 1 indicate , prompting scale adjustments or alternative distributions.

Bayesian Estimation

Bayesian estimation for generalized linear models (GLMs) treats the model parameters, such as the regression coefficients β\beta, as random variables and computes their posterior distribution given the observed yy. The posterior density is given by p(βy)L(βy)π(β),p(\beta \mid y) \propto L(\beta \mid y) \pi(\beta), where L(βy)L(\beta \mid y) is the derived from the distribution of the response, and π(β)\pi(\beta) is the prior distribution on the coefficients. This approach incorporates prior or regularization directly into the , contrasting with by yielding a full rather than point estimates. For GLMs based on distributions, conjugate priors facilitate closed-form or analytically tractable posteriors. A class of such priors, proposed by Chen and Ibrahim, specifies a multivariate normal prior on β\beta conditional on hyperparameters, leading to a posterior that maintains conjugacy and allows straightforward elicitation based on predictive distributions for the mean response. For the normal GLM (e.g., ), a normal prior on β\beta is conjugate, resulting in a normal posterior when combined with the Gaussian likelihood. These priors are particularly useful in low-data scenarios, as they enable exact inference without approximation. In cases where priors are non-conjugate, such as for logistic or Poisson GLMs with complex priors, (MCMC) methods are employed to sample from the posterior. , implemented in software like JAGS, iteratively draws from conditional posteriors for each parameter. For more efficient exploration of high-dimensional posteriors, (HMC), as used in Stan, leverages gradient information to propose distant moves with high acceptance rates, often via the No-U-Turn Sampler variant. These methods enable reliable posterior summaries even for non-standard GLMs. Bayesian estimation offers several advantages over frequentist approaches, including the ability to incorporate in the dispersion ϕ\phi (e.g., in Poisson models) directly into the posterior, and to perform model averaging by placing priors over multiple candidate models. Credible intervals, derived from the posterior quantiles, provide direct probabilistic interpretations of , unlike intervals which rely on asymptotic approximations. In hierarchical Bayesian GLMs, random effects for clustered data are modeled as latent variables with hyperpriors, allowing information sharing across groups to improve estimates; fuller treatments appear in extensions for correlated data.

Key Examples

General Linear Models

The general linear model (GLM) represents a foundational special case of the broader framework, unifying classical approaches to regression and analysis of variance under a for the response variable and an identity link function. In this configuration, the of the response directly equals the linear predictor, enabling straightforward modeling of continuous outcomes assumed to be normally distributed. This model underpins much of traditional for linear relationships, providing the basis from which more flexible generalizations extend to non-normal data. The mathematical form of the general linear model is given by Y=Xβ+ϵ,\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, where Y\mathbf{Y} is the n×1n \times 1 vector of observations, X\mathbf{X} is the n×pn \times p design matrix of predictors, β\boldsymbol{\beta} is the p×1p \times 1 vector of unknown parameters, and ϵ\boldsymbol{\epsilon} is the n×1n \times 1 error vector distributed as ϵN(0,σ2I)\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}), with σ2\sigma^2 denoting the constant error variance and I\mathbf{I} the identity matrix. The identity link ensures that the mean μ=Xβ\mu = \mathbf{X}\boldsymbol{\beta}, aligning the linear predictor directly with the response scale. Parameter estimation typically employs ordinary least squares (OLS), which minimizes the sum of squared residuals and yields the estimator β^=(XTX)1XTy.\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. Under the normality assumption, this OLS estimator is equivalent to the maximum likelihood estimator, ensuring optimal properties such as unbiasedness and minimum variance among unbiased estimators. The validity of inference in the general linear model hinges on three core assumptions: normality of the errors, homoscedasticity (constant variance across levels of the predictors), and independence of observations. Normality implies that ϵN(0,σ2I)\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}), supporting exact tests and confidence intervals; violations can lead to biased standard errors or invalid hypothesis tests. Homoscedasticity requires Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2 for all ii, while independence assumes no correlation among errors, often justified by random sampling. Analysis of variance (ANOVA) arises as a particular instance of this model when predictors are categorical, partitioning total variance into components attributable to factors and interactions, thus testing group mean differences equivalently to multiple regression on dummy variables. This classical setup transitions seamlessly into the generalized linear model paradigm when the normality assumption does not hold, such as for bounded or count data; here, the linear predictor η=Xβ\eta = \mathbf{X}\boldsymbol{\beta} is retained, but paired with an alternative distribution from the and a suitable link function g(μ)=ηg(\mu) = \eta to accommodate the response's scale and variance structure.

Linear Regression

Linear regression represents a fundamental application of the generalized linear model framework to continuous response variables, where the response YY is assumed to follow a normal distribution with constant variance, and the identity link function is employed. In this setup, the conditional mean of the response is directly specified as μ=Xβ\mu = X\beta, where XX is the design matrix of predictors and β\beta is the vector of regression coefficients. This formulation allows the model to capture linear relationships between predictors and the expected response without transformation, making it suitable for modeling continuous outcomes such as heights, weights, or experimental measurements. As a special case of general linear models, it emphasizes predictive modeling for normally distributed errors. The interpretation of coefficients in is straightforward and intuitive within the GLM context. Each βj\beta_j quantifies the expected change in the response YY for a one-unit increase in the corresponding predictor xjx_j, while holding all other predictors constant. This additive structure facilitates clear causal or associative when assumptions hold, such as in econometric analyses of effects on consumption or biological studies of environmental factors on growth rates. intervals and tests for βj\beta_j rely on the normality assumption to ensure valid . Model diagnostics are essential for validating the assumptions of , normality, homoscedasticity, and independence in . Studentized residuals, which scale raw residuals by their estimated standard errors, are plotted against fitted values or predictors to detect non-linearity, outliers, or heteroscedasticity; deviations from a horizontal band around zero indicate violations. Quantile-quantile (Q-Q) plots compare the ordered studentized residuals to theoretical quantiles from a , with points aligning closely to the reference line supporting the normality assumption. Influence measures like DFBETAS assess the impact of individual observations on specific coefficients, with values exceeding 2 in absolute terms flagging potentially influential points that could distort estimates. To address multicollinearity among predictors, the (VIF) is computed for each βj\beta_j as the ratio of its variance in the full model to the variance in a univariate regression; VIF values greater than 5 or 10 signal problematic that inflates standard errors and reduces interpretability. In the presence of heteroscedasticity—where residual variance varies with fitted values—robust standard errors provide consistent estimates of coefficient variability by adjusting the to account for non-constant variance without altering the point estimates. These diagnostics collectively ensure the reliability of models for continuous responses.

Logistic Regression for Binary Data

Logistic regression models binary response data within the generalized linear model framework, where the response variable follows a binomial distribution with a single trial (n=1), equivalent to a Bernoulli distribution. In this setup, the expected value μ represents the probability of success p for each observation, and the logit link function transforms this probability to the linear predictor η via g(μ) = log(μ / (1 - μ)) = η = Xβ, where X is the design matrix and β are the coefficients. This link ensures that predicted probabilities remain between 0 and 1, making it suitable for classification tasks such as predicting disease presence or customer churn. Estimation of the parameters β typically employs maximum likelihood via (IRLS), which iteratively fits weighted linear regressions to approximate the binomial likelihood. The binomial log-likelihood for is l(β) = ∑ [y_i log(μ_i) + (1 - y_i) log(1 - μ_i)], maximized by updating weights based on the variance μ(1 - μ) and working residuals. Model fit is assessed using deviance, defined as D = 2 [l(saturated) - l(fitted)], which follows a under the of adequate fit; the Hosmer-Lemeshow test further evaluates calibration by grouping observations and comparing observed to expected frequencies across deciles of predicted risk. Coefficients β_j are interpreted as the change in the log-odds of success for a one-unit increase in predictor x_j, holding others constant; the odds ratio exp(β_j) quantifies the multiplicative change in odds. Predicted probabilities are obtained by applying the inverse function σ(η) = 1 / (1 + exp(-η)) to the linear predictor, enabling direct probability estimates for decision-making in applications like diagnostics. Alternative link functions for binary GLMs include the probit link g(μ) = Φ^{-1}(μ) = η, where Φ^{-1} is the inverse of the standard normal, often chosen for its latent variable interpretation assuming an underlying normal threshold model. The complementary log-log link g(μ) = log(-log(1 - μ)) = η suits asymmetric data, such as , and aligns with extreme value distributions in survival contexts. The choice among , , and cloglog depends on theoretical motivations, with preferred for its direct interpretation and computational stability in .

Poisson Regression for Count Data

Poisson regression is a specific type of used to analyze count data, where the response variable represents the number of events occurring in a fixed interval of time or space. The model assumes that the counts follow a , which is part of the , with the mean equal to the variance. In this setup, the linear predictor η is linked to the mean μ via the canonical log link function, g(μ) = log(μ) = η = Xβ, where X is the and β are the regression coefficients. The baseline rate is given by exp(β₀), representing the expected count when all predictors are zero. The coefficients βⱼ in are interpreted as log-rate ratios; a unit increase in predictor xⱼ multiplies the expected by exp(βⱼ), holding other variables constant. To account for varying exposure levels, such as different observation times or population sizes, an offset term log(E) is included in the linear predictor, where E is the exposure; this adjusts the model to estimate rates rather than absolute counts. For example, in modeling disease incidence, log(time) serves as the offset to normalize counts by observation period. A key assumption of the Poisson model is equidispersion, where the variance equals the (Var(Y) = μ). occurs when the variance exceeds the (Var(Y) > μ), often due to unobserved heterogeneity, leading to underestimated standard errors in standard Poisson regression. One approach is the quasi-Poisson model, which retains the Poisson structure but allows variance Var(Y) = φμ, where φ > 1 is a dispersion estimated from the data; this was introduced to handle mild without altering the distribution. An alternative is the negative binomial regression model, which explicitly models via Var(Y) = μ + αμ², where α ≥ 0 is a dispersion ; when α = 0, it reduces to the Poisson model. Model diagnostics for Poisson regression include Pearson residuals, defined as rᵢ = (yᵢ - μᵢ) / √μᵢ, which standardize deviations for assessing fit and influence; these residuals should approximate a standard under the model. Rootograms provide a graphical diagnostic by plotting the square roots of observed and fitted frequencies against count values, helping visualize deviations in the distribution shape, such as or poor fit at specific counts. For data with excess zeros beyond Poisson expectations, zero-inflated Poisson models extend the framework by incorporating a separate process for structural zeros, typically using a link for the probability of zero and a Poisson component for positive counts.

Multinomial Regression

Multinomial regression within the framework of generalized linear models addresses response variables that are categorical with more than two possible outcomes, encompassing both unordered (nominal) and ordered (ordinal) cases. For nominal responses, the model predicts the probability of each category relative to a baseline, while for ordinal responses, it accounts for the inherent ordering by modeling cumulative probabilities. This approach builds on the distribution, using the for the response and an appropriate link function to connect the linear predictor to the probabilities. In the case of unordered categories, the multinomial model is employed, with one category designated as the baseline for comparison. The probability of observing category jj given covariates XX is given by P(Y=j)=exp(ηj)kexp(ηk),P(Y = j) = \frac{\exp(\eta_j)}{\sum_k \exp(\eta_k)}, where ηj=Xβj\eta_j = X \beta_j for j=1,,J1j = 1, \dots, J-1, and the baseline category has ηJ=0\eta_J = 0. This formulation ensures the probabilities sum to 1 across all JJ categories. For ordered categories, the cumulative model is typically used, which applies the link to the cumulative probabilities P(Yj)P(Y \leq j). Under the proportional assumption, the log- ratios are constant across cumulative cuts, expressed as log(P(Yj)1P(Yj))=αjXβ\log \left( \frac{P(Y \leq j)}{1 - P(Y \leq j)} \right) = \alpha_j - X \beta, where β\beta represents common coefficients for the covariates. A link can alternatively be applied for the cumulative model, though the is more common. Estimation of parameters in multinomial regression proceeds via maximum likelihood, where the log-likelihood is constructed from the multinomial probabilities derived through the softmax function, equivalent to the expression above for P(Y=j)P(Y = j). The softmax ensures normalized outputs suitable for probabilistic interpretation during optimization. For inference on category-specific effects, Wald tests compare coefficients across outcomes, assessing whether predictors differ significantly in their impact on category probabilities relative to the baseline. Interpretation focuses on the exponentiated coefficients for practical insight. In the multinomial logit model, exp(βjk)\exp(\beta_{jk}) yields relative risk ratios (or more precisely, odds ratios) indicating the change in odds of category jj versus the baseline for a one-unit increase in covariate kk, holding others constant. For the ordered cumulative logit, exp(βk)\exp(\beta_k) represents the common log-odds ratio across thresholds, quantifying how covariates shift the cumulative probabilities and thus the location of the ordinal scale. These interpretations emphasize relative effects, aiding in understanding predictor influences on category selection or ordering. This setup generalizes binary logistic regression to multiple categories, reducing to the binary case when J=2J = 2.

Extensions and Advanced Topics

Generalized Additive Models

Generalized additive models (GAMs) extend generalized linear models (GLMs) by allowing non-linear relationships between predictors and the linear predictor through the use of smooth functions, while preserving the distribution and link function of the GLM framework. Introduced by Hastie and Tibshirani in , GAMs replace the of predictors in the standard GLM with an additive sum of smooth functions, enabling flexible modeling of complex data patterns without assuming a specific parametric form for each covariate effect. This approach is particularly useful for and situations where relationships are suspected to be non-linear but additive across variables. The core form of a GAM is given by ηi=β0+j=1pfj(xij),\eta_i = \beta_0 + \sum_{j=1}^p f_j(x_{ij}), where ηi\eta_i is the linear predictor for the ii-th , β0\beta_0 is an intercept, xijx_{ij} is the value of the jj-th predictor for the ii-th , and fjf_j are unspecified smooth functions, often represented by splines or other basis expansions. If each fjf_j is linear, the GAM reduces to a standard GLM, maintaining the connection to the original framework. Common implementations, such as those in the mgcv, employ penalized splines for the smooth functions fjf_j, balancing fit and smoothness via penalties on the second derivatives. Estimation in GAMs typically involves iterative methods like backfitting, which alternates fitting univariate smooths to residuals while holding others fixed, or penalized likelihood maximization, which incorporates smoothing penalties directly into the log-likelihood. The smoothness of each fjf_j is controlled by parameters estimated via generalized cross-validation or similar criteria, with effective providing a measure of model complexity analogous to parametric degrees of freedom in GLMs. These methods ensure that the resulting smooths are interpretable and avoid , as the additive structure facilitates partial dependence-like visualizations of individual predictor effects. The primary advantages of GAMs lie in their ability to capture non-linearities using a modest number of terms, improving predictive accuracy and model fit over rigid parametric assumptions without the interpretability loss of fully non-parametric methods. By visualizing the estimated smooth functions fjf_j, users can gain insights into the shape of relationships, such as monotonic increases or humps, which aids in scientific interpretation and hypothesis generation. This flexibility has made GAMs widely adopted in fields like , , and for modeling responses such as species abundance or disease risk.

Models for Correlated Data

Generalized linear models assume that observations are independent, but many real-world datasets exhibit due to clustering, repeated measures, or hierarchical structures, such as patients followed over time or schools with multiple students. This violates the assumption, leading to biased standard errors and invalid if unaddressed. Extensions to GLMs for correlated data incorporate dependence structures while maintaining the form for the response and link function for the . These methods are essential in fields like , , and social sciences where data naturally arise in groups. Generalized estimating equations (GEE) provide a approach to handle correlated data without fully specifying the joint distribution. Proposed by Liang and Zeger (1986), GEE extends the score equations of standard GLMs by including a working matrix to account for within-cluster dependencies, while initially assuming working independence for parameter estimation. The estimating equations are solved iteratively to obtain consistent estimates of the regression coefficients, even under misspecified working correlations. To ensure valid inference, a robust variance estimator—known as the sandwich estimator—is used, which adjusts for the actual correlation structure empirically: V^=I^1B^I^1,\hat{V} = \hat{I}^{-1} \hat{B} \hat{I}^{-1}, where I^\hat{I} is the model-based information matrix and B^\hat{B} captures the empirical variance of the score contributions. This approach yields population-averaged effects, interpreting coefficients as average changes across the population rather than for specific subjects. In contrast, generalized linear mixed models (GLMMs) explicitly model correlation through random effects integrated into the linear predictor. Breslow and Clayton (1993) formalized GLMMs by augmenting the fixed-effects linear predictor η=Xβ\eta = X\beta with random effects uN(0,G)u \sim N(0, G), where GG is a specifying the dependence, such as random intercepts for clustering or random slopes for varying trajectories. The is obtained by integrating over the random effects distribution, often approximated via Laplace methods or to handle the intractable : L(β,θ)=if(yiXiβ+Ziu;ϕ)g(u;G(θ))du,L(\beta, \theta) = \int \prod_i f(y_i | X_i \beta + Z_i u; \phi) \, g(u; G(\theta)) \, du, with θ\theta parameterizing GG and ϕ\phi the dispersion. This framework produces subject-specific interpretations, where fixed effects represent effects conditional on the random effects for an individual cluster. Model-based standard errors rely on correct specification of both the mean and random effects structures. Practical applications illustrate these methods' utility. For longitudinal binary data, such as respiratory status measurements over time in patients with , GEE with an autoregressive order 1 (AR(1)) working captures decaying dependence between consecutive observations, improving over assumptions while maintaining robustness. In clustered count data, like the number of epileptic seizures per across treatment periods, a Poisson GLMM with random intercepts models extra-Poisson variation due to patient-specific baselines, where the random effect ujN(0,σu2)u_j \sim N(0, \sigma^2_u) induces positive within clusters. These examples highlight how GEE prioritizes marginal effects suitable for questions, whereas GLMMs support individualized predictions. Inference in these models differs in interpretation and robustness. GEE delivers population-averaged (marginal) effects, averaging over the distribution, which aligns with unconditional summaries in . GLMMs, however, provide subject-specific (conditional) effects, conditioning on realized random effects for deeper mechanistic insights. Standard errors in GEE are robust to correlation misspecification via the sandwich form, whereas GLMM standard errors are model-based and sensitive to random effects assumptions, though bootstrap alternatives can enhance reliability. The choice between approaches depends on scientific goals: population-level summaries favor GEE, while heterogeneity across units suits GLMMs.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.