Hubbry Logo
Log-linear analysisLog-linear analysisMain
Open search
Log-linear analysis
Community hub
Log-linear analysis
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Log-linear analysis
Log-linear analysis
from Wikipedia

Log-linear analysis is a technique used in statistics to examine the relationship between more than two categorical variables. The technique is used for both hypothesis testing and model building. In both these uses, models are tested to find the most parsimonious (i.e., least complex) model that best accounts for the variance in the observed frequencies. (A Pearson's chi-square test could be used instead of log-linear analysis, but that technique only allows for two of the variables to be compared at a time.[1])

Fitting criterion

[edit]

Log-linear analysis uses a likelihood ratio statistic that has an approximate chi-square distribution when the sample size is large:[2]

where

natural logarithm;
observed frequency in cellij (i = row and j = column);
expected frequency in cellij.
the deviance for the model.[3]

Assumptions

[edit]

There are three assumptions in log-linear analysis:[2]

1. The observations are independent and random;

2. Observed frequencies are normally distributed about expected frequencies over repeated samples. This is a good approximation if both (a) the expected frequencies are greater than or equal to 5 for 80% or more of the categories and (b) all expected frequencies are greater than 1. Violations to this assumption result in a large reduction in power. Suggested solutions to this violation are: delete a variable, combine levels of one variable (e.g., put males and females together), or collect more data.

3. The logarithm of the expected value of the response variable is a linear combination of the explanatory variables. This assumption is so fundamental that it is rarely mentioned, but like most linearity assumptions, it is rarely exact and often simply made to obtain a tractable model.

Additionally, data should always be categorical. Continuous data can first be converted to categorical data, with some loss of information. With both continuous and categorical data, it would be best to use logistic regression. (Any data that is analysed with log-linear analysis can also be analysed with logistic regression. The technique chosen depends on the research questions.)

Variables

[edit]

In log-linear analysis there is no clear distinction between what variables are the independent or dependent variables. The variables are treated the same. However, often the theoretical background of the variables will lead the variables to be interpreted as either the independent or dependent variables.[1]

Models

[edit]

The goal of log-linear analysis is to determine which model components are necessary to retain in order to best account for the data. Model components are the number of main effects and interactions in the model. For example, if we examine the relationship between three variables—variable A, variable B, and variable C—there are seven model components in the saturated model. The three main effects (A, B, C), the three two-way interactions (AB, AC, BC), and the one three-way interaction (ABC) gives the seven model components.

The log-linear models can be thought of to be on a continuum with the two extremes being the simplest model and the saturated model. The simplest model is the model where all the expected frequencies are equal. This is true when the variables are not related. The saturated model is the model that includes all the model components. This model will always explain the data the best, but it is the least parsimonious as everything is included. In this model, observed frequencies equal expected frequencies, therefore in the likelihood ratio chi-square statistic, the ratio and . This results in the likelihood ratio chi-square statistic being equal to 0, which is the best model fit.[2] Other possible models are the conditional equiprobability model and the mutual dependence model.[1]

Each log-linear model can be represented as a log-linear equation. For example, with the three variables (A, B, C) the saturated model has the following log-linear equation:[1]

where

expected frequency in cellijk;
the relative weight of each variable.

Hierarchical model

[edit]

Log-linear analysis models can be hierarchical or nonhierarchical. Hierarchical models are the most common. These models contain all the lower order interactions and main effects of the interaction to be examined.[1]

Graphical model

[edit]

A log-linear model is graphical if, whenever the model contains all two-factor terms generated by a higher-order interaction, the model also contains the higher-order interaction.[4] As a direct-consequence, graphical models are hierarchical. Moreover, being completely determined by its two-factor terms, a graphical model can be represented by an undirected graph, where the vertices represent the variables and the edges represent the two-factor terms included in the model.

Decomposable model

[edit]

A log-linear model is decomposable if it is graphical and if the corresponding graph is chordal.

Model fit

[edit]

The model fits well when the residuals (i.e., observed-expected) are close to 0, that is the closer the observed frequencies are to the expected frequencies the better the model fit. If the likelihood ratio chi-square statistic is non-significant, then the model fits well (i.e., calculated expected frequencies are close to observed frequencies). If the likelihood ratio chi-square statistic is significant, then the model does not fit well (i.e., calculated expected frequencies are not close to observed frequencies).

Backward elimination is used to determine which of the model components are necessary to retain in order to best account for the data. Log-linear analysis starts with the saturated model and the highest order interactions are removed until the model no longer accurately fits the data. Specifically, at each stage, after the removal of the highest ordered interaction, the likelihood ratio chi-square statistic is computed to measure how well the model is fitting the data. The highest ordered interactions are no longer removed when the likelihood ratio chi-square statistic becomes significant.[2]

Comparing models

[edit]

When two models are nested, models can also be compared using a chi-square difference test. The chi-square difference test is computed by subtracting the likelihood ratio chi-square statistics for the two models being compared. This value is then compared to the chi-square critical value at their difference in degrees of freedom. If the chi-square difference is smaller than the chi-square critical value, the new model fits the data significantly better and is the preferred model. Else, if the chi-square difference is larger than the critical value, the less parsimonious model is preferred.[1]

Follow-up tests

[edit]

Once the model of best fit is determined, the highest-order interaction is examined by conducting chi-square analyses at different levels of one of the variables. To conduct chi-square analyses, one needs to break the model down into a 2 × 2 or 2 × 1 contingency table.[2]

For example, if one is examining the relationship among four variables, and the model of best fit contained one of the three-way interactions, one would examine its simple two-way interactions at different levels of the third variable.

Effect sizes

[edit]

To compare effect sizes of the interactions between the variables, odds ratios are used. Odds ratios are preferred over chi-square statistics for two main reasons:[1]

1. Odds ratios are independent of the sample size;

2. Odds ratios are not affected by unequal marginal distributions.

Software

[edit]

For datasets with a few variables – general log-linear models

[edit]

For datasets with hundreds of variables – decomposable models

[edit]

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Log-linear analysis is a statistical technique used to model associations and interactions among categorical variables through the analysis of contingency tables, where the logarithm of the expected cell frequencies is expressed as a of parameters representing main effects and interactions. This approach treats the observed counts as realizations of a , enabling the use of generalized linear models to test hypotheses such as independence or in multi-way tables. Developed in the as a multiplicative extension of earlier categorical methods, log-linear analysis provides a flexible framework for hierarchical model building and goodness-of-fit assessment via likelihood ratio tests. The method's core formulation posits that for a two-way contingency table with dimensions II × JJ, the log-expected frequency in cell (ii, jj) follows log(μij)=λ+λiX+λjY+λijXY\log(\mu_{ij}) = \lambda + \lambda^X_i + \lambda^Y_j + \lambda^{XY}_{ij}, where λ\lambda represents the overall mean, λiX\lambda^X_i and λjY\lambda^Y_j are main effects, and λijXY\lambda^{XY}_{ij} captures the association; exponentiating parameters yields odds ratios for interpreting effect sizes. For higher-dimensional tables, additional interaction terms are included hierarchically, allowing examination of complex relationships like partial associations in three-way designs. Model estimation typically employs maximum likelihood via iterative proportional fitting or Newton-Raphson algorithms, with challenges such as structural zeros addressed through specialized algebraic techniques. Historically, log-linear models built on foundational work in contingency table analysis from the early 20th century, including Pearson's chi-square test (1900) and Bartlett's maximum likelihood approaches (1935), but gained prominence through Birch's 1963 formalization of multiplicative models using u-term expansions. Widely applied in fields like social sciences, , and , log-linear analysis can handle sparse data and ordinal variables—though sparse cases pose estimation challenges—offering alternatives to when all variables are treated symmetrically as predictors. Its integration with graphical models and extensions to random effects have further enhanced its utility in modern .

Fundamentals

Definition and Purpose

Log-linear analysis is a statistical technique used to model the relationships among multiple categorical variables through the analysis of multi-dimensional contingency tables. It expresses the logarithm of the expected cell frequencies as a of parameters that capture main effects and interactions among the variables. Formally, the model is given by log(μi)=β0+jβjxij+j<kβjkxijxik+,\log(\mu_i) = \beta_0 + \sum_j \beta_j x_{ij} + \sum_{j<k} \beta_{jk} x_{ij} x_{ik} + \cdots, where μi\mu_i denotes the expected frequency in cell ii, β0\beta_0 is the intercept, the βj\beta_j represent main effects, the βjk\beta_{jk} capture two-way interactions, and higher-order terms account for more complex associations, with xijx_{ij} as indicator variables for the categories of variable jj. The primary purpose of log-linear analysis is to identify and test for associations, partial independence, and higher-order interactions in cross-classified categorical data, especially when traditional methods like or are inappropriate due to the discrete nature of the variables. It enables researchers to assess whether variables are independent or exhibit dependencies that cannot be explained by lower-order effects alone, providing insights into the structure of multivariate frequency distributions. Unlike logistic regression, which models the conditional distribution of one categorical response variable given the others, log-linear analysis models the joint distribution of all variables simultaneously, treating none as distinctly dependent or independent. This approach was developed in the 1960s, with key formalizations such as Birch's 1963 work on multiplicative models, building on foundational work in multivariate categorical analysis.

Key Assumptions

Log-linear analysis relies on several foundational statistical assumptions to ensure the validity of inferences drawn from categorical data in contingency tables. Under the Poisson sampling framework, a primary assumption is the independence of cell counts, meaning that the counts in different cells are independent. This independence ensures that the variability in one cell does not influence another, allowing the model to accurately capture associations among variables without confounding from correlated errors. Under multinomial sampling, cell counts are dependent due to the fixed total sample size, but the model accounts for this. The sampling scheme underlying the data must also align with the model's structure, typically following either a Poisson distribution for individual cell counts—where the total sample size is not fixed—or a multinomial distribution when margins are fixed, such as in product-multinomial sampling. These distributions lead naturally to the log-linear parameterization because the logarithm of the expected cell frequencies is modeled as a linear function of the parameters, facilitating the analysis of cell counts as responses. Under the Poisson assumption, cell counts are independent Poisson random variables, while the multinomial variant conditions on fixed totals to model proportions. Another key assumption is the absence of structural zeros in the contingency table, where all cells are presumed to have positive probability unless theoretically impossible due to the study's design. Structural zeros represent combinations of categories that cannot occur (e.g., impossible events), and their presence without explicit modeling can bias parameter estimates; thus, standard log-linear models assume such cells are either nonexistent or handled by excluding them from the table structure. For reliable inference, particularly when using likelihood ratio tests or Pearson chi-square statistics to assess model fit, the data must satisfy a sufficient sample size requirement: expected frequencies should be at least 5 in the majority (typically 80%) of cells. This condition supports the asymptotic chi-square approximation of the deviance and ensures that parameter estimates are stable and tests have appropriate Type I error rates; smaller expected values may necessitate exact methods or collapsed tables. Finally, the model incorporates an assumption of homogeneity of variance implicit in the Poisson sampling framework, where the variance of each cell count equals its expected value (Var(Y) = E(Y) = μ). This equidispersion property is crucial for the maximum likelihood estimation and standard error calculations in log-linear models; violations, such as overdispersion, may require extensions like negative binomial variants.

Types of Variables

Log-linear analysis primarily involves categorical variables, which are divided into nominal and ordinal types based on whether they possess an inherent order. Nominal variables represent unordered categories, such as gender (male/female) or religious affiliation (Protestant/Catholic/other), where the levels have no natural ranking and are treated as distinct groups without implying superiority or progression. These variables are typically encoded using dummy indicator variables, each corresponding to a category except a reference level, to facilitate modeling in the log-linear framework. Ordinal variables, however, feature ordered categories, for example, education levels (elementary/secondary/college) or satisfaction ratings (low/medium/high), where the sequence reflects increasing or decreasing intensity; they can also be represented by indicators but may incorporate scoring to leverage the order for more efficient analysis. The foundational data structure for log-linear analysis is the contingency table, a multi-way array that cross-classifies observations according to the levels of two or more categorical variables, with each cell containing the observed frequency count of occurrences for that combination. For example, a 2×2×3 contingency table might tabulate counts across two binary nominal variables and one ordinal variable with three levels, enabling the examination of joint distributions. The margins of these tables—row totals, column totals, or higher-dimensional sums—represent univariate marginal distributions or partial associations between subsets of variables, providing summaries of the data's structure before modeling interactions. In contingency tables for log-linear models, the treatment of margins as fixed or random depends on the underlying sampling scheme, which influences the distributional assumptions. Under Poisson sampling, all margins are random, with cell counts modeled as independent Poisson random variables whose means equal their variances, suitable for independent count data without fixed totals. In multinomial sampling, the overall sample size (a one-dimensional margin) is fixed, rendering other margins random conditional on this total, which aligns with scenarios where observations are categorized into mutually exclusive cells summing to a known n. This distinction ensures that the model accounts for the dependencies introduced by fixed margins, such as in prospective studies where row totals are controlled. Interaction terms in log-linear analysis describe the associations among categorical variables through main effects, two-way interactions, and higher-order terms, forming the core of how dependencies are modeled in contingency tables. Main effects capture univariate marginal distributions for individual variables, reflecting their standalone influences on cell frequencies. Two-way interaction terms model bivariate associations between pairs of variables, such as the joint distribution of gender and education level, indicating whether categories co-occur more or less than expected under independence. Higher-order interactions, like three-way terms, address conditional dependencies among three or more variables—for instance, how the association between two variables varies across levels of a third—allowing for the representation of complex, multifaceted relationships in multi-way tables.

Model Specification and Fitting

Fitting Criteria

In log-linear analysis, parameters β\beta are estimated using maximum likelihood estimation (MLE), which selects values that maximize the likelihood of the observed cell frequencies nin_i under the assumption that these frequencies follow a Poisson distribution with expected values μi=exp(Xiβ)\mu_i = \exp(X_i \beta), where XiX_i is the design vector for cell ii. The Poisson likelihood function for the parameters β\beta given the observed counts n=(n1,,nI)n = (n_1, \dots, n_I) is L(βn)i=1Iμiniexp(μi)ni!,L(\beta \mid n) \propto \prod_{i=1}^I \frac{\mu_i^{n_i} \exp(-\mu_i)}{n_i!}, where the product is over all II cells in the contingency table, and the factorial terms are constants with respect to β\beta. Maximizing this likelihood, or equivalently its logarithm, requires solving nonlinear equations, typically through iterative numerical procedures such as the Newton-Raphson method or iteratively reweighted least squares (IRLS), which converge to the MLE under standard conditions for generalized linear models. For large samples, the MLE β^\hat{\beta} possesses desirable asymptotic properties: it is consistent (converging in probability to the true β\beta), asymptotically unbiased, and asymptotically normally distributed with covariance matrix given by the inverse of the observed Fisher information. In practice, contingency tables often contain empty cells (where ni=0n_i = 0), which can complicate estimation if they lead to non-convergence or infinite parameter estimates; common strategies include collapsing table dimensions to combine cells or, if theoretically justified, adding small constants (such as 0.5) to all cells prior to fitting, though the latter risks biasing results and is generally discouraged without strong rationale.

Hierarchical Principles

In log-linear analysis, the hierarchical principle governs model specification by requiring that any k-way interaction term included in the model must be accompanied by all lower-order interactions (up to (k-1)-way) among the same variables, as well as their constituent main effects. For example, including a three-way interaction term ABC necessitates the two-way terms AB, AC, and BC, along with the main effects A, B, and C. This constraint, rooted in the principle of marginality, ensures that higher-order effects are interpreted relative to the associations captured by lower-order terms, preventing the estimation of isolated partial interactions that lack substantive meaning. The hierarchical structure embodies a Markov-like assumption, wherein the conditional independence of variables given the lower-order terms is implicitly modeled; higher-order interactions thus represent deviations from independence that are conditional on the specified margins. This property facilitates the decomposition of complex associations into interpretable components, aligning with the iterative nature of model building in categorical data analysis. Hierarchical models are compactly specified using bracket notation, where terms denote the highest-order interactions, and lower-order terms are automatically included. For instance, the notation [AB][C] specifies a model with the two-way interaction between variables A and B, the main effect of C, and—by hierarchy—the main effects of A and B, corresponding to the equation logμijk=u+uA(i)+uB(j)+uC(k)+uAB(ij)\log \mu_{ijk} = u + u_A^{(i)} + u_B^{(j)} + u_C^{(k)} + u_{AB}^{(ij)}. Such notation streamlines the description of models ranging from complete independence ([A][B][C]) to partial associations. Adopting the hierarchical principle offers key advantages, including reduced overfitting by limiting the parameter space to parsimonious, nested structures and enhanced interpretability through systematic progression from main effects to higher interactions. This approach supports forward or backward selection strategies, where models are compared hierarchically to identify significant associations without redundant terms. A practical illustration occurs in analyzing a 2×2×2 contingency table, such as one cross-classifying gender (A), treatment (B), and outcome (C). The hierarchical model positing no three-way interaction but allowing all two-way interactions is expressed as: logμijk=u+uA(i)+uB(j)+uC(k)+uAB(ij)+uAC(ik)+uBC(jk)\log \mu_{ijk} = u + u_A^{(i)} + u_B^{(j)} + u_C^{(k)} + u_{AB}^{(ij)} + u_{AC}^{(ik)} + u_{BC}^{(jk)} This formulation tests for pairwise associations (e.g., treatment-outcome conditional on gender) while assuming uniformity across the third variable's levels, providing a baseline for assessing more complex dependencies.

Model Types

General Log-linear Models

General log-linear models provide a framework for analyzing associations among multiple categorical variables in contingency tables by modeling the logarithm of the expected cell frequencies as an additive function of parameters representing main effects and interactions. These models treat all variables symmetrically, without distinguishing between response and explanatory variables, and are fitted using maximum likelihood estimation, often via iterative methods like iterative proportional fitting. The approach is particularly suited for moderate-sized tables with a few variables, enabling tests of hypotheses about independence and conditional associations. A foundational example is the mutual independence model, which assumes no associations among the variables. For a three-way contingency table with variables A, B, and C, this model is specified as logμijk=u+uiA+ujB+ukC\log \mu_{ijk} = u + u_i^A + u_j^B + u_k^C, where μijk\mu_{ijk} denotes the expected frequency in cell (i,j,k)(i,j,k), uu is the overall mean log-frequency, and the uu terms capture the main effects. Under this model, the expected frequencies factorize as the product of the one-way marginal distributions, implying that the joint distribution is the product of the marginals. This model serves as a baseline for assessing higher-order dependencies. Partial association models extend the mutual independence framework by incorporating selected interactions while assuming conditional independencies for others. For instance, the model [AB][AC][AB][AC] for three variables includes main effects for A, B, and C, along with two-way interactions AB and AC, but omits the BC interaction, yielding logμijk=u+uiA+ujB+ukC+uijAB+uikAC\log \mu_{ijk} = u + u_i^A + u_j^B + u_k^C + u_{ij}^{AB} + u_{ik}^{AC}. This specification tests the conditional independence of B and C given A, where the absence of the BC term implies that any association between B and C is explained by their mutual relations with A. Such models allow for flexible exploration of partial dependencies in multi-way tables. The saturation model represents the most complex case, incorporating all possible main effects and interactions up to the highest order. For an I×J×KI \times J \times K table, it includes terms through the three-way ABC interaction, resulting in a parameter count equal to the number of cells (IJKIJK), yielding zero degrees of freedom and a perfect fit to the observed data (μ^ijk=nijk\hat{\mu}_{ijk} = n_{ijk} for all cells). While useful as a reference for model comparison, it provides no parsimony or insight into underlying structures. Interpretation of parameters in general log-linear models focuses on their role in describing log-expected margins and associations. The main effect terms uiu_i quantify deviations in the log-expected frequencies for specific levels of a variable from the grand mean, averaged over the other variables; for example, u1Au2A=log(μ^1/μ^2)u_1^A - u_2^A = \log(\hat{\mu}_{\cdot 1 \cdot}/\hat{\mu}_{\cdot 2 \cdot}), the log ratio of marginal means. Interaction parameters capture multiplicative effects; in two-way cases, the uiju_{ij} terms correspond to log-odds ratios measuring the strength of association between two variables, conditional on others in higher dimensions. These interpretations facilitate understanding of how variables jointly influence cell frequencies. Despite their flexibility, general log-linear models face significant limitations with increasing dimensionality. The number of potential parameters expands exponentially with the number of variables (e.g., 2p2^p interactions for pp binary variables), exacerbating the curse of dimensionality: tables become increasingly sparse, estimation becomes computationally intensive due to the need for iterative algorithms over large parameter spaces, and models risk overfitting without strong prior constraints on interactions. These challenges restrict practical application to tables with no more than four or five variables unless specialized structures are imposed.

Graphical Models

Graphical log-linear models represent conditional independencies and interactions among categorical variables in multidimensional contingency tables using undirected graphs. These models facilitate the specification of log-linear models by visualizing the structure of associations, where the absence of certain interactions is explicitly encoded through the graph's topology. In a graphical log-linear model, nodes correspond to the categorical variables (or factors), and undirected edges between nodes denote direct associations, typically interpreted as two-factor interactions. The presence of an edge indicates that the variables are directly related, while the absence suggests potential conditional independence given other variables. Higher-order interactions are implied only if necessary to maintain the graph's structure. The graph encodes conditional independencies via the d-separation criterion: two sets of variables are conditionally independent given a third set if the third set separates them in the graph, meaning there is no active path connecting the two sets when the separating nodes are conditioned upon. This separation implies that no higher-order interactions beyond those captured by the graph are required in the model. For example, if variables B and C are separated by A, then B is conditionally independent of C given A (BC | A), precluding a three-way interaction term [ABC]. The log-linear model is generated by including terms that correspond to the maximal cliques—complete subgraphs where every pair of nodes is connected by an edge—in the graph. Each maximal clique generates an interaction term of the order equal to the number of variables in it, and the full set of such terms defines the model. Marginal terms for smaller cliques are automatically included to ensure the model is hierarchical. Consider three binary variables A, B, and C forming a graph with edges AB and AC but no edge BC. The maximal cliques are {A, B} and {A, C}, leading to the model [AB][AC][A]. This specifies the log-expected cell frequencies as logmabc=μ+λaA+λbB+λcC+λabAB+λacAC,\log m_{abc} = \mu + \lambda^A_a + \lambda^B_b + \lambda^C_c + \lambda^{AB}_{ab} + \lambda^{AC}_{ac}, implying BC | A. Graphical log-linear models offer advantages as a visual aid for discerning complex interdependencies among multiple variables, making it easier to hypothesize and test structures in high-dimensional data. They also connect to broader probabilistic modeling paradigms, such as Bayesian networks, by sharing foundational Markov properties that link graphical separation to statistical independence. These models can be fitted using maximum likelihood estimation, often via iterative proportional fitting algorithms.

Decomposable Models

Decomposable log-linear models form a special subclass of graphical log-linear models, characterized by an underlying dependence graph that is chordal—meaning it contains no induced cycles of length greater than three—and admits a simplicial ordering of its cliques, allowing recursive decomposition into independent components separated by complete cliques. This structure ensures that the model factorizes exactly over the maximal cliques CC and separators SS, facilitating efficient computation without approximation. A key advantage of decomposability is the availability of closed-form maximum likelihood estimates (MLEs), derived directly from the observed marginal tables corresponding to the cliques, with adjustments for the separators to account for overlaps; this eliminates the need for iterative optimization, making estimation computationally tractable even in higher dimensions. The iterative proportional fitting (IPF) algorithm provides an alternative fitting method for these models, operating by cyclically scaling the current estimate to match the observed margins for each clique until convergence, which occurs in a finite number of steps for decomposable cases. The expected cell frequencies μ\mu in a decomposable model satisfy the factorization μ=CμCvC/SμSvS,\mu = \prod_{C} \mu_{C}^{v_{C}} / \prod_{S} \mu_{S}^{v_{S}}, where CC denotes the cliques, SS the separators, and vv the respective multiplicities reflecting how often each component appears in the decomposition. These models are particularly suited to sparse, high-dimensional contingency tables, such as those arising in genetics for analyzing multi-locus associations or in social surveys for multi-way interactions among categorical variables, where the chordal structure exploits sparsity to enable exact inference.

Model Evaluation

Assessing Model Fit

Assessing the fit of a log-linear model to contingency table data primarily involves goodness-of-fit tests that measure discrepancies between observed cell counts nin_i and model-expected counts μi\mu_i, where the latter are derived from maximum likelihood estimation under the Poisson assumption. The Pearson chi-square statistic, X2=i(niμi)2μiX^2 = \sum_i \frac{(n_i - \mu_i)^2}{\mu_i}, quantifies this discrepancy and asymptotically follows a chi-squared distribution when expected counts are sufficiently large (typically at least 5 per cell). Similarly, the likelihood ratio statistic, G2=2inilog(ni/μi)G^2 = 2 \sum_i n_i \log(n_i / \mu_i), provides an alternative measure that is often preferred for its additive property, allowing straightforward decomposition for nested models, and it also approximates a chi-squared distribution under the null hypothesis of adequate fit. Both statistics test the model against the saturated model, with small values (or large p-values) indicating good reproduction of the observed data. The degrees of freedom for these chi-squared tests are calculated as the total number of cells in the table minus the number of free parameters estimated in the model; in Poisson log-linear models, this subtracts the parameters defining the log-expected values, including the overall mean level. For instance, in a two-way table with II rows and JJ columns under a model of independence, the degrees of freedom equal (I1)(J1)(I-1)(J-1). These degrees of freedom ensure the test accounts for model complexity, providing a benchmark for significance. When overall fit statistics suggest inadequacy, residual analysis helps pinpoint poorly fitted cells. Standardized Pearson residuals, niμiμi\frac{n_i - \mu_i}{\sqrt{\mu_i}}
Add your contribution
Related Hubs
User Avatar
No comments yet.